Open vponomaryov opened 1 year ago
We wait for readiness, I wonder if this is the case where it becomes ready without having a quorum.
@vponomaryov do we a run of this with scylla-core latest master ?
- 5.1.0-rc3
I didn't run it with master. Only with the specified versions in the bug report.
Can we run it at least once with master version ?
And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.
Can we run it at least once with master version ?
And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.
The error is expected because a Scylla node always becomes UN
for itself first and only then to others.
So, what we really should try to keep is slowness like in the GKE platform.
For example, this bug is not reproduced using all-in-one K8S (kind
).
So, dtest reproducer
will need to have some tricks to cause slow status updates.
We can slow down traffic between two nodes
The probability to hit this bug is very high.
Our nodetool flush and reshard
K8S test, which triggers rollout restart, constantly suffers from it.
Latest failure is here: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-k8s-local-kind-aws/25/consoleFull
The appropriate nemesis is skipped constantly already.
@tnozicka - I'm unsure if our logic should not be converted to 'ask other nodes what node X status is' instead of 'ask node X what its status is' ...
Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.
@rzetelskik can you summarize the update of your discoveries we talked about today?
Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.
@rzetelskik can you summarize the update of your discoveries we talked about today?
@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...
@rzetelskik can you summarize the update of your discoveries we talked about today?
Sure. In an effort to reproduce this I've added a simple coroutine to our E2E test suite that'd just send read queries periodically with quorum consistency and it caused the suite to fail quite consistently, so I haven't yet followed up on why this only kept happening on GKE for you guys, since the issue seems more general anyway.
Our readiness probe only checks whether the node considers itself UN. We've hit issues related to it and discussed it many times so that's nothing new. By the time a node declares readiness, the gossip should've settled, although nothing stops it from settling with a split brain, so it's quite easy to hit the following sequence:
{1: {1: UN, 2: UN, 3:UN}, 2: {1: UN, 2: UN, 3: UN}, 3: {1: UN, 2: UN, 3: UN}}
{1: {1: UN, 2: UN, 3:DN}, 2: {1: UN, 2: UN, 3: DN}, 3: {1: DN, 2: DN, 3: UN}}
{1: {1: UN, 2: DN, 3:DN}, 3: {1: DN, 2: DN, 3: UN}}
At this point you'll get errors from queries with quorum consistency regardless of which node you hit.
So we've discussed different approaches to how we could modify the readiness probe or the operator to get closer to what the documentation advises, i.e. all nodes UP before performing a topology change. Essentially we want our readiness probes' semantics to no longer just be "I'm ready to serve traffic" but "I'm ready for a topology change". However we weren't able to just declare readiness when, and only when, a node considers all of its peers UN, since that caused some issues with rollouts.
The current idea suggested by @tnozicka is that each node should ask all of its peers about statuses of their peers and only declare readiness if all of them consider at least N-1 of their peers UN, where N is the total numbers of peers.
@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...
I think what Tomas's message meant is that currently the readiness probe only declares that a given node is able to serve queries with consistency ONE (which is only true if that node replicates the data in question) and at that point we were discussing implementing a heuristic where a node would check whether a quorum of its peers is UN.
It happens not only when changing number of cores. The same situation happened on GKE typical rolling restart. node-2:
INFO 2023-01-29 02:30:46,965 [shard 0] init - Scylla version 5.3.0~dev-0.20230127.5eadea301ecc initialization completed.
WARN 2023-01-29 02:31:34,956 [shard 0] cdc - Could not update CDC description table with generation (2023/01/29 01:40:35, 970df7c8-82ce-403b-b4a9-4804eaa2cee8): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
just before cdc errors node-1 restarted:
2023-01-29 02:30:57,175 INFO waiting for scylla to stop
INFO 2023-01-29 02:30:57,180 [shard 0] compaction_manager - Asked to stop
Cluster is not operable for CL=QUORUM when RF=3
During this run we had only one RollingRestart nemesis and it failed. Observed on GKE.
Kernel Version: 5.15.0-1023-gke
Scylla version (or git commit hash): 5.3.0~dev-20230127.5eadea301ecc
with build-id 057196c711d6c52ca6781558c576c858a3cf049b
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.9.0-alpha.0-41-g04c20d3 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (n1-standard-8)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: N/A
(k8s-gke: us-east1)
Test: longevity-scylla-operator-3h-gke
Test id: c3397295-e68c-4ce5-bc39-3bd16e034133
Test name: scylla-operator/operator-master/gke/longevity-scylla-operator-3h-gke
Test config file(s):
@soyacz yes, we are aware this can happen at any rolling restart, and for some reason it's happen more often in GKE (we don't know the exact reason, probably related to some network configuration)
anyhow we are waiting for #1108 that supposes fixed the issue
we still need to figure out how to fix it the right way in the operator, but this procedure is likely broken in scylla cloud and regular scylla deployments as well so I've filled https://github.com/scylladb/scylladb/issues/13156
@vponomaryov btw. does QA have a test that tries to write to scylla with CL=LOCAL_QUORUM during a rolling restart as for https://docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html ?
@vponomaryov btw. does QA have a test that tries to write to scylla with CL=LOCAL_QUORUM during a rolling restart as for https://docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html ?
when working with one DC, LOCAL_QUORUM=QUORUM, and all our test are writing/reading with QUORUM
good. I am asking because I heard an opinion that CL=LOCAL_QUORUM may not be an invariant of a rolling restart. (Although I don't share that view.)
Issue persists in v1.10.0-rc.0.
What I noticed, e.g. when issuing decommission right after node starts, decommission fails with error:
E0817 15:45:52.814551 1 sidecar/controller.go:143] syncing key 'scylla/sct-cluster-us-east1-b-us-east1-2' failed: can't decommision a node: can't decommission the node: agent [HTTP 500] std::runtime_error (decommission[3d0d13b7-d1e2-47de-9a24-20560dd6ec71]: Cannot start: nodes={10.48.24.248, 10.48.24.36} needed for decommission operation are down. It is highly recommended to fix the down nodes and try again.)
only until gossip data is updated on the node - when that happens, decommission continues and passes.
More details can be found in: Argus- error during decommission
We still hit it. It gets caught too often.
The rolling restart disruption info:
sct-cluster-us-east1-b-us-east1-1 2023-09-03 01:21:04 - 2023-09-03 01:25:19
Trigger of the rolling restart:
2023-09-03 01:21:05,592 f:base.py l:142 c:LocalCmdRunner p:DEBUG > Command "kubectl --cache-dir=/home/ubuntu/sct-results/20230902-232320-224619/.kube/http-cache --namespace=scylla patch scyllacluster sct-cluster --type merge -p '{"spec": {"forceRedeploymentReason": "Triggered at 1693704065.4157088"}}'" finished with status 0
node-2 logs:
INFO 2023-09-03 01:17:47,705 [shard 3] compaction - [Compact keyspace1.standard1 af8d8020-49f7-11ee-a42b-c8cedf9ecb3d] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-7816a1d049ed11eebba5c5bd07980710/me-3g92_03lk_3nlnk2hvkncqu8cj0d-big-Data.db:level=0]. 75MB to 72MB (~96% of original) in 3089ms = 24MB/s. ~304384 total partitions merged to 290432.
2023-09-03 01:21:06,047 INFO waiting for scylla to stop
INFO 2023-09-03 01:21:06,047 [shard 0] compaction_manager - Asked to stop
...
INFO 2023-09-03 01:21:08,527 [shard 0] init - Scylla version 5.4.0~dev-0.20230901.3bdbe620aa1e shutdown complete.
...
I0903 01:21:27.813689 1 operator/sidecar.go:147] sidecar version "v1.11.0-alpha.0-58-g6b310ee"
...
I0903 01:21:28.502851 1 operator/sidecar.go:360] "Starting scylla"
...
INFO 2023-09-03 01:21:44,450 [shard 0] storage_service - entering NORMAL mode
...
WARN 2023-09-03 01:21:44,456 [shard 0] cdc - Could not update CDC description table with generation (2023/09/03 01:12:20, 6d6145f2-4783-464e-934c-c8aeabb49da0): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will retry in the background.
...
INFO 2023-09-03 01:21:56,466 [shard 0] init - Scylla version 5.4.0~dev-0.20230901.3bdbe620aa1e initialization completed.
WARN 2023-09-03 01:22:44,457 [shard 0] cdc - Could not update CDC description table with generation (2023/09/03 01:12:20, 6d6145f2-4783-464e-934c-c8aeabb49da0): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
INFO 2023-09-03 01:23:21,520 [shard 0] storage_service - Node 10.72.1.33 state jump to normal
INFO 2023-09-03 01:23:21,527 [shard 0] storage_service - Node 10.72.2.65 state jump to normal
Node-1 logs:
INFO 2023-09-03 01:19:54,367 [shard 3] compaction - [Compact keyspace1.standard1 f8aa3c80-49f7-11ee-afd5-cea0326f1042] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-7816a1d049ed11eebba5c5bd07980710/me-3g92_03oz_1mxww2o9ggflhfx9aa-big-Data.db:level=0]. 262MB to 177MB (~67% of original) in 7092ms = 37MB/s. ~1041152 total partitions merged to 699954.
INFO 2023-09-03 01:21:06,079 [shard 0] gossip - Got shutdown message from 10.72.7.246, received_generation=1693703423, local_generation=1693703423
INFO 2023-09-03 01:21:06,081 [shard 0] gossip - InetAddress 10.72.7.246 is now DOWN, status = shutdown
INFO 2023-09-03 01:21:08,929 [shard 0] raft_group_registry - marking Raft server 47b25347-bd20-45d8-83b5-5916710c7b2e as dead for raft groups
2023-09-03 01:22:03,669 INFO waiting for scylla to stop
INFO 2023-09-03 01:22:03,669 [shard 0] compaction_manager - Asked to stop
INFO 2023-09-03 01:22:03,669 [shard 0] init - Signal received; shutting down
INFO 2023-09-03 01:22:03,669 [shard 0] init - Shutting down view builder ops
INFO 2023-09-03 01:22:03,669 [shard 0] view - Draining view builder
INFO 2023-09-03 01:22:03,669 [shard 0] compaction_manager - Stopped
...
INFO 2023-09-03 01:23:24,650 [shard 0] init - Scylla version 5.4.0~dev-0.20230901.3bdbe620aa1e shutdown complete.
...
I0903 01:23:42.450382 1 operator/sidecar.go:147] sidecar version "v1.11.0-alpha.0-58-g6b310ee"
...
I0903 01:23:43.105871 1 operator/sidecar.go:360] "Starting scylla"
Loader failure:
2023-09-03 01:22:03,888 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:CRITICAL > java.io.IOException: Operation x10 on key(s) [3939333332504e503030]: Error executing: (WriteFailureException): Cassandra failure during write query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)
So, from the logs above we see that the node-2
gets restarted first and then the node-1
gets restarted after the init - Scylla version 5.4.0~dev-0.20230901.3bdbe620aa1e initialization completed.
message for node-2
.
But should have been restarted after INFO 2023-09-03 01:23:21,527 [shard 0] storage_service - Node 10.72.2.65 state jump to normal
.
Kernel Version: 5.15.0-1038-gke
Scylla version (or git commit hash): 5.4.0~dev-20230901.3bdbe620aa1e
with build-id f44ecb6140c40cc466146304eb075258461d5dd7
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.11.0-alpha.0-58-g6b310ee Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (n1-standard-8)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: scylladb/scylladb:5.4.0-dev-0.20230901.3bdbe620aa1e
(k8s-gke: undefined_region)
Test: longevity-scylla-operator-3h-gke
Test id: 47c37575-94c9-4b01-a99d-c41b87360927
Test name: scylla-operator/operator-master/gke/longevity-scylla-operator-3h-gke
Test config file(s):
during testing of operator 1.11, I've hit in multiple cases both on GKE where we see it quite often, and on EKS longevity (in 12h run, and multi tenancy)
thanks, we are having some architectural discussions about what we could do for v1.12, so this is not forgotten but we have limited bandwidth with everything else going on
I ran experiments on GKE with minReadySeconds on StatefulSets - in big shortcut it's a sleep between updating next node, on different setups.
Each setup consisted of 3 unbound loaders sending read requests continuously using gocql driver with default config. After cluster was fully up and running, test triggered a rolling restart. Traffic was stopped 5s after restart was completed.
setup | requests | failures | success ratio |
---|---|---|---|
ClusterIP TLS minReadySeconds=0 | 360008 | 278367 | 0.226 |
ClusterIP TLS minReadySeconds=60 | 161707 | 18085 | 0.888 |
ClusterIP TLS minReadySeconds=120 | 293230 | 12554 | 0.957 |
PodIP TLS minReadySeconds=0 | 80427 | 78 | 0.999 |
PodIP TLS minReadySeconds=60 | 118925 | 53 | 0.999 |
PodIP TLS minReadySeconds=120 | 244028 | 74 | 0.999 |
Discrepancy between ClusterIP and PodIP results might suggest that kube-proxy which provides ClusterIP overlay might cause most of the failures. To verify whether it's the case, I repeated the tests on GKE with Dataplane V2 (Cillium managed by Google) where kube-proxy is not present.
setup | requests | failures | success ratio |
---|---|---|---|
ClusterIP TLS minReadySeconds=0 | 54367 | 979 | 0.982 |
ClusterIP TLS minReadySeconds=60 | 113112 | 441 | 0.996 |
ClusterIP TLS minReadySeconds=120 | 201340 | 789 | 0.996 |
PodIP TLS minReadySeconds=0 | 61446 | 1701 | 0.972 |
PodIP TLS minReadySeconds=60 | 114669 | 159 | 0.999 |
PodIP TLS minReadySeconds=120 | 198945 | 638 | 0.997 |
Results shows Operator is not able to provide 100% success rate in any setup even when minReadySeconds is high. Although setting it to 60s would help a lot on default ClusterIP configuration while not influencing bootstrap time that much.
Simillar results were present when traffic was non-TLS.
@zimnx good job with getting all these results. Were you able to find out the mode that kube-proxy is running in in the default scenario? Given the results I'd assume it's iptables but we need to check.
Also, coming back to what the original issue says:
It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.
It was over a year ago, so it might have changed, but @vponomaryov do you modify the networking stack in EKS anyhow, or is it vanilla?
It was over a year ago, so it might have changed, but @vponomaryov do you modify the networking stack in EKS anyhow, or is it vanilla?
vpc-cni
addon.WARM_ENI_TARGET=0
, MINIMUM_IP_TARGET=8
and WARM_IP_TARGET=2
env vars to the aws-node
daemonset.After my last experiment, I aggregated the same types of errors.
Most of the errors were unavailability errors and I found out that in setups where number of total errors is low, almost all errors shows up when a node is tearing down.
I looked up Scylla documentation about how nodes should be teared down, and we have a mismatch. Changing PreStopHook to nodetool drain
+ supervisorctl stop scylla
, caused that Scylla started printing lots of messages about operation being aborted on other nodes. I found related issue (https://github.com/scylladb/scylladb/issues/10447), unfortunately fix is not easy to backport, so it wasn't backported to recent versions. Apparently master was fixed but when I tried I wasn't able to restart any node due to scylla being stuck. I tried with older versions, and found out that 5.0.13 doesn't print these abort operation messages, and it also solved failures happening during node teardown.
On PodIP setup I was left with only 2-3 EOF failures which either are scylla bug not gracefully shutting down connection, or gocql driver misbehaving. I left these unresolved to proceed further, we can tackle them later.
This unfortunately didn't solve traffic disruption on ClusterIP setups where errors were showing more than 1 node being down. This meant there's a split brain in gossip state. Looking at multiple nodetool status of all nodes confirmed it.
I initially thought that maybe kube-proxy lags and iptables are not updated fast enough, but I ruled it out as experiments showed Service ClusterIP mappings in iptables are updated right after Pod is recreated with new PodIP.
Scylla keeps 4 connections between each shard and node on 7000 port used for inter-node communication. One of them is used for gossip. These connections are lazily initialized so it might be that there're less than 4 connections if one was dropped and there wasn't anyone needing one.
If we look at stable 2 node ClusterIP cluster where each node have 1 shard, we can see existing connections and their mapping using conntrack. Brief info about the cluster state:
pod-0: 10.85.0.28
pod-1: 10.85.0.31
svc-0: 10.106.100.33
svc-1: 10.99.34.233
tcp 6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1
tcp 6 431972 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1
tcp 6 431972 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1
tcp 6 431943 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1
There're 3 connections from pod-0, to svc-1 with correct NATed address of pod-1, and 4 connections from pod-1 to svc-0 with correct NATed address of pod-0.
When pod-1 is being deleted, conntrack shows multiple attempts where pod-0 is trying to reconnect to svc-1 but fails which is expected.
When pod-1 is recreated with new IP 10.85.0.32
, we can see that there're 3 ongoing attempts to connect to svc-1 but with old pod-1 IP address in SYN_SENT state meaning they are awaiting ACK:
tcp 6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=53454 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=53454 mark=0 use=1
tcp 6 74 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=57715 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=57715 mark=0 use=1
tcp 6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=64317 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=64317 mark=0 use=1
Between when pod-1 was deleted, and new one recreated, pod-0 tried to reconnect but traffic was blackholed, meaning this SYN is lost, and this session needs to timeout. At the same time, we can see that old sessions entered TIME_WAIT state which is normal, and that pod-1 managed to connect to svc-0:
tcp 6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=2
tcp 6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1
tcp 6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1
tcp 6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=58760 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=58760 [ASSURED] mark=0 use=1
tcp 6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1
tcp 6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1
tcp 6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1
tcp 6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=50906 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=50906 [ASSURED] mark=0 use=1
tcp 6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp 6 431998 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1
After SYN_SENT sessions expired, there were no sessions from pod-0 to svc-1, only pod-1 to svc-0:
tcp 6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1
Eventually pod-0 tried to reconnect and succeeded with correct pod-1 IP:
tcp 6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54055 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=54055 [ASSURED] mark=0 use=1
tcp 6 431991 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55146 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=55146 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=52300 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=52300 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp 6 431991 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1
tcp 6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
Looks like SYN_SENT sessions are the root cause of our gossip brainsplit as nodes cannot establish session to node being restarted. They report UN, as they are able to talk to others, while others are stuck on connection. Immediately after these stuck session expire, new sessions are established and gossip state synchronizes.
On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s, and then Scylla needs to reconnect which can take another couple of seconds and then exchange gossip.
To solve it we have several options:
@zimnx thanks for diving into it. I wonder how many more times conntrack is going to bite us. :stuck_out_tongue:
On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s
Just to understand it, could you explain the calculation here?
- remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.
Remember that such workaround would require running with additional linux capabilities. Looking at the conntrack cleaner daemonset (https://github.com/scylladb/scylla-operator/pull/973/commits/9a63e0115955d3523e3da8faca2f64406db45cb0), which we used in our CI to resolve a similar issue in the past (https://github.com/scylladb/scylla-operator/issues/971), it would require NET_ADMIN.
I also wonder if that wouldn't affect stability of otherwise healthy clusters.
- Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.
I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?
Maybe something else, suggestions are welcome.
It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?
Anyway, an approach that gets rid of the issue without us implementing any workarounds is to advise our users to remove the netfilter dependency if they require exposing their nodes on ClusterIPs. Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).
On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s
Just to understand it, could you explain the calculation here?
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value is 6, which corresponds to 63seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for an active TCP connection attempt will happen after 127seconds.
Remember that such workaround would require running with additional linux capabilities
Our nodesetup pods are already running as root, so no extra permissions are required.
I also wonder if that wouldn't affect stability of otherwise healthy clusters.
Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.
I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?
That's what I plan to look into next, maybe there's something else causing disruption.
It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?
Nope
Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).
I haven't tried with different node teardown logic. It's something I want to tryout later as well.
remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.
Another option would be to lower net.netfilter.nf_conntrack_tcp_timeout_syn_sent
from default 120s to lower value from out sysctls-buddy
container. I haven't tested, but with this conntrack entry translating ClusterIP into stale Pod IP would vanish earlier, and next TCP retransmission would fail triggering reconnection. Although it's a global setting.
Leaving as a workaround we might consider.
Doesn't work, this setting needs to be changed on node, not in container network namespace.
I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.
setup | requests | EOF/request timeout | availability failures | success ratio |
---|---|---|---|---|
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s | 56621 | 4 | 0 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s | 56411 | 4 | 0 | 0.999 |
We no longer see any availability errors meaning Scylla rolls out without loosing Quorum. EOFs might be scylla or gocql bug, not related to rollout. Timeouts may happen, as setups runs with low non-guaranteed resources.
Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.
setup | requests | EOF/request timeout | availability failures | success ratio |
---|---|---|---|---|
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s | 33926 | 3 | 0 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s | 33750 | 4 | 0 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s | 31458 | 1 | 19 | 0.999 |
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s | 34919 | 0 | 15 | 0.999 |
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s | 30176 | 2 | 16 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s | 30512 | 3 | 13 | 0.999 |
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s | 32213 | 1 | 11 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s | 31847 | 2 | 20 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s | 40221 | 1 | 5 | 0.999 |
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s | 40892 | 2 | 7 | 0.999 |
Versions >=5.1 cause request failures during when node is shutting down.
Changing net.netfilter.nf_conntrack_tcp_timeout_syn_sent
to 60s
breaks ClusterIP scenarios:
setup | requests | EOF/request timeout | availability failures | success ratio |
---|---|---|---|---|
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s | 35792 | 9 | 0 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s | 35688 | 2 | 0 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s | 36021 | 1 | 21 | 0.999 |
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s | 37997 | 9 | 2620 | 0.93 |
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s | 36494 | 10 | 3055 | 0.92 |
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s | 33809 | 3 | 10 | 0.999 |
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s | 38529 | 12 | 2816 | 0.93 |
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s | 40817 | 2 | 9 | 0.999 |
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s | 41197 | 1 | 3 | 0.999 |
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s | 42154 | 12 | 3599 | 0.91 |
Looks like setting net.netfilter.nf_conntrack_tcp_timeout_syn_sent
to 20s
would fix ClusterIP setups as it would enforce shorter timeout on rpc_client connection attempts.
Adding configuration option to Scylla which would allow controlling this timeout would allow us to get rid of this workaround.
We also need to fix Scylla, as supported versions have a bug causing availability issues when node is shutting down.
@zimnx could we make it an optional feature in node setup? The users would then have a choice between running on PodIPs without any issues, using this workaround, or running a kube-proxy-less cluster, which sounds like reasonably many choices.
Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout https://github.com/scylladb/scylladb/issues/15899 https://github.com/scylladb/scylladb/issues/15901
Great job drilling down into this!
I don't think we want to get into managing contrack settings for a subset of platforms in node setup DS. It seems fragile, potentially affecting other workloads and quite platform dependent. Scylla timeout seems like more fitting and stable place even if we have to wait a bit.
Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?
@xemul looked at multiple RPC issues recently, may be he has some idea.
Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?
Rollout proceeds, process dies some time after we drain it and kill it without need for ungraceful shutdown.
Kostia suggested it might be related to too short shutdown-announce-in-ms
, but when I set it to 10minutes it didn't help. Details and logs from the run are available under https://github.com/scylladb/scylladb/issues/15899
After couple of days I have doubts whether timeout on rpc_client connection would solve the issue. I think, it would only postpone fixing the real issue.
The reason why timeout would help us, is because it would give nodes another attempt to connect to restarted node. New attempt usually means new source port, session would have new ip/port tuple which requires iptables traversal. Effectively using updated ClusterIP mapping. And this is what fixes the issue we hit with ClusterIP clusters. The problem is that it's not reliable.
We can force connection timeout by namespaced net.ipv4.tcp_syn_retries
which would mimic the proposed timeout only in Scylla Pod. It wouldn't help in all these cases by itself. Additionally we could set non-namespaced net.netfilter.nf_conntrack_tcp_timeout_syn_sent
but it would only help with case 3. Although on setups where ClusterIP is NATed not using netfliter but something else, similar situation with stale entry in NAT map may happen. We won't be able to solve it once for all.
Looks like even with configurable timeout, we wouldn't be able to survive rollouts without availability loss, because UN information from single node is not enough. We need to be sure every required connection is up both ways. Gossip is exchanged with one connection. Depending on the timing, some connections may be established, and some may be stuck. If one of the established ones was gossip connection, we could have UN state agreement across all nodes if just gossip exchange is enough to call it UN, I don't know, worth asking core.
Seems like one of the first idea about asking multiple nodes about status of ourselves would be a good way to solve these availability issues. Together with recommendation to set additional sysctls they should make readiness pass quick enough. I wonder if we will be able to ask the status of a given node using raft.
I wonder if we will be able to ask the status of a given node using raft.
We won't. Status of a node consist of two states, node state (normal, bootstrapping, joining, leaving) and state of failure detector (up/down). Node state will be consistent across cluster with raft, but failure detector state will still be per node.
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
/lifecycle stale
Describe the bug If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition: First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn't wait for it and restarts second pod. In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes. It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.
Below are the logs from a run on GKE cluster.
pod-2
(the first restarted) logs:pod-1
logs (second restarted):So, from the logs above we can see that the time when
pod-1
(second restarted) gotshutdown
message the first one was not ready yet. The appearedquorum
error from CDC is the proof for it.To Reproduce Steps to reproduce the behavior:
scyllacluster
specExpected behavior Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.
Logs K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz
Environment: