During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster

scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB

https://operator.docs.scylladb.com/

Apache License 2.0

327 stars 160 forks source link

During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster #1077

Open vponomaryov opened 1 year ago

vponomaryov commented 1 year ago

Describe the bug If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition: First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn't wait for it and restarts second pod. In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes. It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

Below are the logs from a run on GKE cluster. pod-2 (the first restarted) logs:

2022-10-27 18:40:55,710 INFO waiting for scylla to stop
INFO  2022-10-27 18:40:55,710 [shard 0] init - Signal received; shutting down
...
INFO  2022-10-27 18:40:57,897 [shard 0] gossip - Gossip is now stopped
...
INFO  2022-10-27 18:40:59,866 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd shutdown complete.
2022-10-27 18:40:59,940 INFO stopped: scylla (exit status 0)
...
I1027 18:41:03.305900       1 operator/sidecar.go:438] "Scylla process finished"
rpc error: code = NotFound desc = an error occurred when try to find container "ef77bed95aa5838282bc5d55420a2718d6b111912fc2954dc645c35c7ce87d3f": not foundI1027 18:41:11.371885       1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-133-g97c831e"
...
I1027 18:41:12.363766       1 operator/sidecar.go:360] "Starting scylla"
...
INFO  2022-10-27 18:41:15,764 [shard 0] database - Resharded 7kB for system.compaction_history in 0.54 seconds, 13kB/s
...
INFO  2022-10-27 18:41:18,583 [shard 0] init - loading system_schema sstables
...
INFO  2022-10-27 18:41:19,311 [shard 0] init - setting up system keyspace
...
INFO  2022-10-27 18:41:21,558 [shard 0] database - Resharded 235MB for keyspace1.standard1 in 2.02 seconds, 116MB/s
...
INFO  2022-10-27 18:41:21,621 [shard 0] storage_service - entering STARTING mode
...
INFO  2022-10-27 18:41:21,700 [shard 0] storage_service - Starting up server gossip
...
E1027 18:41:24.931854       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
E1027 18:41:34.932431       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
INFO  2022-10-27 18:41:44,256 [shard 0] storage_service - entering JOINING mode
...
INFO  2022-10-27 18:41:44,828 [shard 0] storage_service - Node 10.80.11.2 state jump to normal
E1027 18:41:44.931320       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
INFO  2022-10-27 18:41:45,035 [shard 0] storage_service - entering NORMAL mode
...
INFO  2022-10-27 18:41:57,051 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd initialization completed.
WARN  2022-10-27 18:42:45,043 [shard 0] cdc - Could not update CDC description table with generation (2022/10/27 18:24:22, 69265668-5186-4a22-a2b1-6b6bad5a0f55): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
WARN  2022-10-27 18:43:14,796 [shard 0] gossip - Skip marking node 10.80.4.107 with status = shutdown as UP
INFO  2022-10-27 18:43:14,796 [shard 0] gossip - InetAddress 10.80.14.183 is now UP, status = NORMAL
INFO  2022-10-27 18:43:14,798 [shard 0] storage_service - Node 10.80.14.183 state jump to normal
INFO  2022-10-27 18:43:14,801 [shard 0] storage_service - Node 10.80.4.107 state jump to normal
INFO  2022-10-27 18:43:14,950 [shard 0] gossip - InetAddress 10.80.12.113 is now DOWN, status = LEFT
INFO  2022-10-27 18:43:14,952 [shard 0] gossip - Node 10.80.12.113 will be removed from gossip at [2022-10-30 18:31:53]: (expire = 1667154713831435293, now = 1666896194952348069, diff = 258518 seconds)
...
WARN  2022-10-27 18:43:36,752 [shard 0] gossip - Fail to send EchoMessage to 10.80.4.107: seastar::rpc::closed_error (connection is closed)
INFO  2022-10-27 18:43:36,892 [shard 0] gossip - InetAddress 10.80.4.107 is now UP, status = NORMAL

pod-1 logs (second restarted):

...
INFO  2022-10-27 18:32:55,041 [shard 0] gossip - 60000 ms elapsed, 10.80.12.113 gossip quarantine over
INFO  2022-10-27 18:40:55,897 [shard 0] gossip - Got shutdown message from 10.80.11.2, received_generation=1666893961, local_generation=1666893961
INFO  2022-10-27 18:40:55,898 [shard 0] gossip - InetAddress 10.80.11.2 is now DOWN, status = shutdown
2022-10-27 18:42:05,276 INFO waiting for scylla to stop
INFO  2022-10-27 18:42:05,291 [shard 0] compaction_manager - Asked to stop
...

So, from the logs above we can see that the time when pod-1 (second restarted) got shutdown message the first one was not ready yet. The appeared quorum error from CDC is the proof for it.

To Reproduce Steps to reproduce the behavior:

Deploy operator
Deploy 3-pod Scylla cluster on GKE
Run some constant load
Change the number of CPUs in the scyllacluster spec
Wait for the restart of the second pod
See quorum-errors on the loader side.

Expected behavior Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.

Logs K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz

Environment:

Platform: GKE
Kubernetes version: 1.22
Scylla version: 5.0.5 , 5.1.0-rc3
Scylla-operator version: e.g.: v1.8.0-alpha.0-133-g97c831e | v1.7.4

tnozicka commented 1 year ago

We wait for readiness, I wonder if this is the case where it becomes ready without having a quorum.

fruch commented 1 year ago

@vponomaryov do we a run of this with scylla-core latest master ?

vponomaryov commented 1 year ago

5.1.0-rc3

I didn't run it with master. Only with the specified versions in the bug report.

fruch commented 1 year ago

Can we run it at least once with master version ?

And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.

vponomaryov commented 1 year ago

Can we run it at least once with master version ?

And then we should try having a dtest reproducer of it. our rolling restart process shouldn't be that brittle.

Done here: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-gke/17/consoleFull

The error is expected because a Scylla node always becomes UN for itself first and only then to others. So, what we really should try to keep is slowness like in the GKE platform. For example, this bug is not reproduced using all-in-one K8S (kind). So, dtest reproducer will need to have some tricks to cause slow status updates.

fruch commented 1 year ago

We can slow down traffic between two nodes

vponomaryov commented 1 year ago

The probability to hit this bug is very high. Our nodetool flush and reshard K8S test, which triggers rollout restart, constantly suffers from it. Latest failure is here: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-k8s-local-kind-aws/25/consoleFull The appropriate nemesis is skipped constantly already.

mykaul commented 1 year ago

@tnozicka - I'm unsure if our logic should not be converted to 'ask other nodes what node X status is' instead of 'ask node X what its status is' ...

tnozicka commented 1 year ago

Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.

@rzetelskik can you summarize the update of your discoveries we talked about today?

fruch commented 1 year ago

Yeah, currently readiness project ability to talk CONSISTENCY_ONE and we are talking about migrating it project CONSISTENCY_QUORUM.

@rzetelskik can you summarize the update of your discoveries we talked about today?

@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...

rzetelskik commented 1 year ago

@rzetelskik can you summarize the update of your discoveries we talked about today?

Sure. In an effort to reproduce this I've added a simple coroutine to our E2E test suite that'd just send read queries periodically with quorum consistency and it caused the suite to fail quite consistently, so I haven't yet followed up on why this only kept happening on GKE for you guys, since the issue seems more general anyway.

Our readiness probe only checks whether the node considers itself UN. We've hit issues related to it and discussed it many times so that's nothing new. By the time a node declares readiness, the gossip should've settled, although nothing stops it from settling with a split brain, so it's quite easy to hit the following sequence:

Entire 3 node cluster up: {1: {1: UN, 2: UN, 3:UN}, 2: {1: UN, 2: UN, 3: UN}, 3: {1: UN, 2: UN, 3: UN}}
Restart node 3 and wait for it to come back up and declare readiness.
Node 3 is up but not a part of the cluster: {1: {1: UN, 2: UN, 3:DN}, 2: {1: UN, 2: UN, 3: DN}, 3: {1: DN, 2: DN, 3: UN}}
Restart node 2: {1: {1: UN, 2: DN, 3:DN}, 3: {1: DN, 2: DN, 3: UN}}

At this point you'll get errors from queries with quorum consistency regardless of which node you hit.

So we've discussed different approaches to how we could modify the readiness probe or the operator to get closer to what the documentation advises, i.e. all nodes UP before performing a topology change. Essentially we want our readiness probes' semantics to no longer just be "I'm ready to serve traffic" but "I'm ready for a topology change". However we weren't able to just declare readiness when, and only when, a node considers all of its peers UN, since that caused some issues with rollouts.

The current idea suggested by @tnozicka is that each node should ask all of its peers about statuses of their peers and only declare readiness if all of them consider at least N-1 of their peers UN, where N is the total numbers of peers.

@rzetelskik @tnozicka can one of you update on this one, and shed some light on what's "project CONSISTENCY_QUORUM" in this context, cause I don't understand what does it mean...

I think what Tomas's message meant is that currently the readiness probe only declares that a given node is able to serve queries with consistency ONE (which is only true if that node replicates the data in question) and at that point we were discussing implementing a heuristic where a node would check whether a quorum of its peers is UN.

soyacz commented 1 year ago

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

It happens not only when changing number of cores. The same situation happened on GKE typical rolling restart. node-2:

INFO  2023-01-29 02:30:46,965 [shard 0] init - Scylla version 5.3.0~dev-0.20230127.5eadea301ecc initialization completed.
WARN  2023-01-29 02:31:34,956 [shard 0] cdc - Could not update CDC description table with generation (2023/01/29 01:40:35, 970df7c8-82ce-403b-b4a9-4804eaa2cee8): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.

just before cdc errors node-1 restarted:

2023-01-29 02:30:57,175 INFO waiting for scylla to stop
INFO  2023-01-29 02:30:57,180 [shard 0] compaction_manager - Asked to stop

Impact

Cluster is not operable for CL=QUORUM when RF=3

How frequently does it reproduce?

During this run we had only one RollingRestart nemesis and it failed. Observed on GKE.

Installation details

Kernel Version: 5.15.0-1023-gke Scylla version (or git commit hash): 5.3.0~dev-20230127.5eadea301ecc with build-id 057196c711d6c52ca6781558c576c858a3cf049b

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.9.0-alpha.0-41-g04c20d3 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (n1-standard-8)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: N/A (k8s-gke: us-east1)

Test: longevity-scylla-operator-3h-gke Test id: c3397295-e68c-4ce5-bc39-3bd16e034133 Test name: scylla-operator/operator-master/gke/longevity-scylla-operator-3h-gke Test config file(s):

longevity-scylla-operator-3h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor c3397295-e68c-4ce5-bc39-3bd16e034133` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=c3397295-e68c-4ce5-bc39-3bd16e034133) - Show all stored logs command: `$ hydra investigate show-logs c3397295-e68c-4ce5-bc39-3bd16e034133` ## Logs: - **db-cluster-c3397295.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/db-cluster-c3397295.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/db-cluster-c3397295.tar.gz) - **sct-runner-c3397295.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/sct-runner-c3397295.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/sct-runner-c3397295.tar.gz) - **monitor-set-c3397295.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/monitor-set-c3397295.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/monitor-set-c3397295.tar.gz) - **loader-set-c3397295.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/loader-set-c3397295.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/loader-set-c3397295.tar.gz) - **kubernetes-c3397295.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/kubernetes-c3397295.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/kubernetes-c3397295.tar.gz) - **parallel-timelines-report-c3397295.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/parallel-timelines-report-c3397295.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c3397295-e68c-4ce5-bc39-3bd16e034133/20230129_030101/parallel-timelines-report-c3397295.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/gke/job/longevity-scylla-operator-3h-gke/14/)

fruch commented 1 year ago

@soyacz yes, we are aware this can happen at any rolling restart, and for some reason it's happen more often in GKE (we don't know the exact reason, probably related to some network configuration)

anyhow we are waiting for #1108 that supposes fixed the issue

tnozicka commented 1 year ago

we still need to figure out how to fix it the right way in the operator, but this procedure is likely broken in scylla cloud and regular scylla deployments as well so I've filled https://github.com/scylladb/scylladb/issues/13156

tnozicka commented 1 year ago

@vponomaryov btw. does QA have a test that tries to write to scylla with CL=LOCAL_QUORUM during a rolling restart as for https://docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html ?

fruch commented 1 year ago

@vponomaryov btw. does QA have a test that tries to write to scylla with CL=LOCAL_QUORUM during a rolling restart as for https://docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html ?

when working with one DC, LOCAL_QUORUM=QUORUM, and all our test are writing/reading with QUORUM

tnozicka commented 1 year ago

good. I am asking because I heard an opinion that CL=LOCAL_QUORUM may not be an invariant of a rolling restart. (Although I don't share that view.)

soyacz commented 11 months ago

Issue persists in v1.10.0-rc.0. What I noticed, e.g. when issuing decommission right after node starts, decommission fails with error: E0817 15:45:52.814551 1 sidecar/controller.go:143] syncing key 'scylla/sct-cluster-us-east1-b-us-east1-2' failed: can't decommision a node: can't decommission the node: agent [HTTP 500] std::runtime_error (decommission[3d0d13b7-d1e2-47de-9a24-20560dd6ec71]: Cannot start: nodes={10.48.24.248, 10.48.24.36} needed for decommission operation are down. It is highly recommended to fix the down nodes and try again.) only until gossip data is updated on the node - when that happens, decommission continues and passes.

More details can be found in: Argus- error during decommission

vponomaryov commented 10 months ago

We still hit it. It gets caught too often.

Details

The rolling restart disruption info:

sct-cluster-us-east1-b-us-east1-1  2023-09-03 01:21:04  - 2023-09-03 01:25:19

Trigger of the rolling restart:

2023-09-03 01:21:05,592 f:base.py         l:142  c:LocalCmdRunner       p:DEBUG > Command "kubectl --cache-dir=/home/ubuntu/sct-results/20230902-232320-224619/.kube/http-cache --namespace=scylla patch scyllacluster sct-cluster --type merge -p '{"spec": {"forceRedeploymentReason": "Triggered at 1693704065.4157088"}}'" finished with status 0

node-2 logs:

INFO  2023-09-03 01:17:47,705 [shard 3] compaction - [Compact keyspace1.standard1 af8d8020-49f7-11ee-a42b-c8cedf9ecb3d] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-7816a1d049ed11eebba5c5bd07980710/me-3g92_03lk_3nlnk2hvkncqu8cj0d-big-Data.db:level=0]. 75MB to 72MB (~96% of original) in 3089ms = 24MB/s. ~304384 total partitions merged to 290432.
2023-09-03 01:21:06,047 INFO waiting for scylla to stop
INFO  2023-09-03 01:21:06,047 [shard 0] compaction_manager - Asked to stop
    ...
INFO  2023-09-03 01:21:08,527 [shard 0] init - Scylla version 5.4.0~dev-0.20230901.3bdbe620aa1e shutdown complete.
    ...
I0903 01:21:27.813689       1 operator/sidecar.go:147] sidecar version "v1.11.0-alpha.0-58-g6b310ee"
    ...
I0903 01:21:28.502851       1 operator/sidecar.go:360] "Starting scylla"
    ...
INFO  2023-09-03 01:21:44,450 [shard 0] storage_service - entering NORMAL mode
    ...
WARN  2023-09-03 01:21:44,456 [shard 0] cdc - Could not update CDC description table with generation (2023/09/03 01:12:20, 6d6145f2-4783-464e-934c-c8aeabb49da0): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will retry in the background.
    ...
INFO  2023-09-03 01:21:56,466 [shard 0] init - Scylla version 5.4.0~dev-0.20230901.3bdbe620aa1e initialization completed.
WARN  2023-09-03 01:22:44,457 [shard 0] cdc - Could not update CDC description table with generation (2023/09/03 01:12:20, 6d6145f2-4783-464e-934c-c8aeabb49da0): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
INFO  2023-09-03 01:23:21,520 [shard 0] storage_service - Node 10.72.1.33 state jump to normal
INFO  2023-09-03 01:23:21,527 [shard 0] storage_service - Node 10.72.2.65 state jump to normal

Node-1 logs:

INFO  2023-09-03 01:19:54,367 [shard 3] compaction - [Compact keyspace1.standard1 f8aa3c80-49f7-11ee-afd5-cea0326f1042] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-7816a1d049ed11eebba5c5bd07980710/me-3g92_03oz_1mxww2o9ggflhfx9aa-big-Data.db:level=0]. 262MB to 177MB (~67% of original) in 7092ms = 37MB/s. ~1041152 total partitions merged to 699954.
INFO  2023-09-03 01:21:06,079 [shard 0] gossip - Got shutdown message from 10.72.7.246, received_generation=1693703423, local_generation=1693703423
INFO  2023-09-03 01:21:06,081 [shard 0] gossip - InetAddress 10.72.7.246 is now DOWN, status = shutdown
INFO  2023-09-03 01:21:08,929 [shard 0] raft_group_registry - marking Raft server 47b25347-bd20-45d8-83b5-5916710c7b2e as dead for raft groups
2023-09-03 01:22:03,669 INFO waiting for scylla to stop
INFO  2023-09-03 01:22:03,669 [shard 0] compaction_manager - Asked to stop
INFO  2023-09-03 01:22:03,669 [shard 0] init - Signal received; shutting down
INFO  2023-09-03 01:22:03,669 [shard 0] init - Shutting down view builder ops
INFO  2023-09-03 01:22:03,669 [shard 0] view - Draining view builder
INFO  2023-09-03 01:22:03,669 [shard 0] compaction_manager - Stopped
    ...
INFO  2023-09-03 01:23:24,650 [shard 0] init - Scylla version 5.4.0~dev-0.20230901.3bdbe620aa1e shutdown complete.
    ...
I0903 01:23:42.450382       1 operator/sidecar.go:147] sidecar version "v1.11.0-alpha.0-58-g6b310ee"
    ...
I0903 01:23:43.105871       1 operator/sidecar.go:360] "Starting scylla"

Loader failure:

2023-09-03 01:22:03,888 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:CRITICAL > java.io.IOException: Operation x10 on key(s) [3939333332504e503030]: Error executing: (WriteFailureException): Cassandra failure during write query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)

So, from the logs above we see that the node-2 gets restarted first and then the node-1 gets restarted after the init - Scylla version 5.4.0~dev-0.20230901.3bdbe620aa1e initialization completed. message for node-2. But should have been restarted after INFO 2023-09-03 01:23:21,527 [shard 0] storage_service - Node 10.72.2.65 state jump to normal.

Installation details

Kernel Version: 5.15.0-1038-gke Scylla version (or git commit hash): 5.4.0~dev-20230901.3bdbe620aa1e with build-id f44ecb6140c40cc466146304eb075258461d5dd7

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.11.0-alpha.0-58-g6b310ee Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (n1-standard-8)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: scylladb/scylladb:5.4.0-dev-0.20230901.3bdbe620aa1e (k8s-gke: undefined_region)

Test: longevity-scylla-operator-3h-gke Test id: 47c37575-94c9-4b01-a99d-c41b87360927 Test name: scylla-operator/operator-master/gke/longevity-scylla-operator-3h-gke Test config file(s):

longevity-scylla-operator-3h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 47c37575-94c9-4b01-a99d-c41b87360927` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=47c37575-94c9-4b01-a99d-c41b87360927) - Show all stored logs command: `$ hydra investigate show-logs 47c37575-94c9-4b01-a99d-c41b87360927` ## Logs: - **kubernetes-47c37575.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/kubernetes-47c37575.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/kubernetes-47c37575.tar.gz) - **db-cluster-47c37575.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/db-cluster-47c37575.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/db-cluster-47c37575.tar.gz) - **sct-runner-events-47c37575.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/sct-runner-events-47c37575.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/sct-runner-events-47c37575.tar.gz) - **sct-47c37575.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/sct-47c37575.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/sct-47c37575.log.tar.gz) - **loader-set-47c37575.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/loader-set-47c37575.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/loader-set-47c37575.tar.gz) - **monitor-set-47c37575.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/monitor-set-47c37575.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/monitor-set-47c37575.tar.gz) - **parallel-timelines-report-47c37575.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/parallel-timelines-report-47c37575.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/47c37575-94c9-4b01-a99d-c41b87360927/20230903_013209/parallel-timelines-report-47c37575.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/gke/job/longevity-scylla-operator-3h-gke/46/) [Argus](https://argus.scylladb.com/test/b9f7805a-73c8-45a1-9846-5826da89bcb9/runs?additionalRuns[]=47c37575-94c9-4b01-a99d-c41b87360927)

fruch commented 9 months ago

during testing of operator 1.11, I've hit in multiple cases both on GKE where we see it quite often, and on EKS longevity (in 12h run, and multi tenancy)

tnozicka commented 9 months ago

thanks, we are having some architectural discussions about what we could do for v1.12, so this is not forgotten but we have limited bandwidth with everything else going on

zimnx commented 9 months ago

I ran experiments on GKE with minReadySeconds on StatefulSets - in big shortcut it's a sleep between updating next node, on different setups.

Each setup consisted of 3 unbound loaders sending read requests continuously using gocql driver with default config. After cluster was fully up and running, test triggered a rolling restart. Traffic was stopped 5s after restart was completed.

setup	requests	failures	success ratio
ClusterIP TLS minReadySeconds=0	360008	278367	0.226
ClusterIP TLS minReadySeconds=60	161707	18085	0.888
ClusterIP TLS minReadySeconds=120	293230	12554	0.957
PodIP TLS minReadySeconds=0	80427	78	0.999
PodIP TLS minReadySeconds=60	118925	53	0.999
PodIP TLS minReadySeconds=120	244028	74	0.999

Discrepancy between ClusterIP and PodIP results might suggest that kube-proxy which provides ClusterIP overlay might cause most of the failures. To verify whether it's the case, I repeated the tests on GKE with Dataplane V2 (Cillium managed by Google) where kube-proxy is not present.

setup	requests	failures	success ratio
ClusterIP TLS minReadySeconds=0	54367	979	0.982
ClusterIP TLS minReadySeconds=60	113112	441	0.996
ClusterIP TLS minReadySeconds=120	201340	789	0.996
PodIP TLS minReadySeconds=0	61446	1701	0.972
PodIP TLS minReadySeconds=60	114669	159	0.999
PodIP TLS minReadySeconds=120	198945	638	0.997

Results shows Operator is not able to provide 100% success rate in any setup even when minReadySeconds is high. Although setting it to 60s would help a lot on default ClusterIP configuration while not influencing bootstrap time that much.

Simillar results were present when traffic was non-TLS.

rzetelskik commented 9 months ago

@zimnx good job with getting all these results. Were you able to find out the mode that kube-proxy is running in in the default scenario? Given the results I'd assume it's iptables but we need to check.

Also, coming back to what the original issue says:

It happens on GKE and doesn't happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

It was over a year ago, so it might have changed, but @vponomaryov do you modify the networking stack in EKS anyhow, or is it vanilla?

vponomaryov commented 9 months ago

It was over a year ago, so it might have changed, but @vponomaryov do you modify the networking stack in EKS anyhow, or is it vanilla?

We enable the vpc-cni addon.
We set WARM_ENI_TARGET=0, MINIMUM_IP_TARGET=8 and WARM_IP_TARGET=2 env vars to the aws-node daemonset.
We add security groups for each of the network interfaces of K8S nodes.

zimnx commented 9 months ago

After my last experiment, I aggregated the same types of errors.

Most of the errors were unavailability errors and I found out that in setups where number of total errors is low, almost all errors shows up when a node is tearing down.

I looked up Scylla documentation about how nodes should be teared down, and we have a mismatch. Changing PreStopHook to nodetool drain + supervisorctl stop scylla, caused that Scylla started printing lots of messages about operation being aborted on other nodes. I found related issue (https://github.com/scylladb/scylladb/issues/10447), unfortunately fix is not easy to backport, so it wasn't backported to recent versions. Apparently master was fixed but when I tried I wasn't able to restart any node due to scylla being stuck. I tried with older versions, and found out that 5.0.13 doesn't print these abort operation messages, and it also solved failures happening during node teardown. On PodIP setup I was left with only 2-3 EOF failures which either are scylla bug not gracefully shutting down connection, or gocql driver misbehaving. I left these unresolved to proceed further, we can tackle them later.

This unfortunately didn't solve traffic disruption on ClusterIP setups where errors were showing more than 1 node being down. This meant there's a split brain in gossip state. Looking at multiple nodetool status of all nodes confirmed it.

I initially thought that maybe kube-proxy lags and iptables are not updated fast enough, but I ruled it out as experiments showed Service ClusterIP mappings in iptables are updated right after Pod is recreated with new PodIP.

Scylla keeps 4 connections between each shard and node on 7000 port used for inter-node communication. One of them is used for gossip. These connections are lazily initialized so it might be that there're less than 4 connections if one was dropped and there wasn't anyone needing one.

If we look at stable 2 node ClusterIP cluster where each node have 1 shard, we can see existing connections and their mapping using conntrack. Brief info about the cluster state:

pod-0: 10.85.0.28
pod-1: 10.85.0.31

svc-0: 10.106.100.33
svc-1: 10.99.34.233

tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1 
tcp      6 431943 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1

There're 3 connections from pod-0, to svc-1 with correct NATed address of pod-1, and 4 connections from pod-1 to svc-0 with correct NATed address of pod-0.

When pod-1 is being deleted, conntrack shows multiple attempts where pod-0 is trying to reconnect to svc-1 but fails which is expected. When pod-1 is recreated with new IP 10.85.0.32, we can see that there're 3 ongoing attempts to connect to svc-1 but with old pod-1 IP address in SYN_SENT state meaning they are awaiting ACK:

tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=53454 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=53454 mark=0 use=1
tcp      6 74 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=57715 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=57715 mark=0 use=1
tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=64317 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=64317 mark=0 use=1

Between when pod-1 was deleted, and new one recreated, pod-0 tried to reconnect but traffic was blackholed, meaning this SYN is lost, and this session needs to timeout. At the same time, we can see that old sessions entered TIME_WAIT state which is normal, and that pod-1 managed to connect to svc-0:

tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=2       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=58760 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=58760 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1     
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=50906 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=50906 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1      
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431998 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

After SYN_SENT sessions expired, there were no sessions from pod-0 to svc-1, only pod-1 to svc-0:

tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

Eventually pod-0 tried to reconnect and succeeded with correct pod-1 IP:

tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54055 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=54055 [ASSURED] mark=0 use=1 
tcp      6 431991 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55146 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=55146 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=52300 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=52300 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431991 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1

Looks like SYN_SENT sessions are the root cause of our gossip brainsplit as nodes cannot establish session to node being restarted. They report UN, as they are able to talk to others, while others are stuck on connection. Immediately after these stuck session expire, new sessions are established and gossip state synchronizes.

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s, and then Scylla needs to reconnect which can take another couple of seconds and then exchange gossip.

To solve it we have several options:

introduce new configuration value into Scylla which would control the timeout on rpc client connect - new feature meaning we will get it only in next version
remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.
Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.
Come back to initial idea about collecting state of all (QUORUM?) nodes in readiness probe.
Maybe something else, suggestions are welcome.

rzetelskik commented 9 months ago

@zimnx thanks for diving into it. I wonder how many more times conntrack is going to bite us. :stuck_out_tongue:

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.

Remember that such workaround would require running with additional linux capabilities. Looking at the conntrack cleaner daemonset (https://github.com/scylladb/scylla-operator/pull/973/commits/9a63e0115955d3523e3da8faca2f64406db45cb0), which we used in our CI to resolve a similar issue in the past (https://github.com/scylladb/scylla-operator/issues/971), it would require NET_ADMIN.

I also wonder if that wouldn't affect stability of otherwise healthy clusters.

Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

Maybe something else, suggestions are welcome.

It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Anyway, an approach that gets rid of the issue without us implementing any workarounds is to advise our users to remove the netfilter dependency if they require exposing their nodes on ClusterIPs. Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

zimnx commented 9 months ago

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value is 6, which corresponds to 63seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for an active TCP connection attempt will happen after 127seconds.

Remember that such workaround would require running with additional linux capabilities

Our nodesetup pods are already running as root, so no extra permissions are required.

I also wonder if that wouldn't affect stability of otherwise healthy clusters.

Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

That's what I plan to look into next, maybe there's something else causing disruption.

It's worth explicitly stating that the issue comes from running kube-proxy depending on netfilter's conntrack. From my understanding at this point we're not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Nope

Like you said above, you haven't hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

I haven't tried with different node teardown logic. It's something I want to tryout later as well.

zimnx commented 9 months ago

remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there's no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.

Another option would be to lower net.netfilter.nf_conntrack_tcp_timeout_syn_sent from default 120s to lower value from out sysctls-buddy container. I haven't tested, but with this conntrack entry translating ClusterIP into stale Pod IP would vanish earlier, and next TCP retransmission would fail triggering reconnection. Although it's a global setting.

~~Leaving as a workaround we might consider.~~

Doesn't work, this setting needs to be changed on node, not in container network namespace.

zimnx commented 9 months ago

I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	56621	4	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	56411	4	0	0.999

We no longer see any availability errors meaning Scylla rolls out without loosing Quorum. EOFs might be scylla or gocql bug, not related to rollout. Timeouts may happen, as setups runs with low non-guaranteed resources.

Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	33926	3	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	33750	4	0	0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s	31458	1	19	0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s	34919	0	15	0.999
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s	30176	2	16	0.999
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s	30512	3	13	0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s	32213	1	11	0.999
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s	31847	2	20	0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s	40221	1	5	0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s	40892	2	7	0.999

Versions >=5.1 cause request failures during when node is shutting down.

Changing net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 60s breaks ClusterIP scenarios:

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s	35792	9	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s	35688	2	0	0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s	36021	1	21	0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s	37997	9	2620	0.93
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s	36494	10	3055	0.92
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s	33809	3	10	0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s	38529	12	2816	0.93
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s	40817	2	9	0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s	41197	1	3	0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s	42154	12	3599	0.91

Looks like setting net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 20s would fix ClusterIP setups as it would enforce shorter timeout on rpc_client connection attempts. Adding configuration option to Scylla which would allow controlling this timeout would allow us to get rid of this workaround. We also need to fix Scylla, as supported versions have a bug causing availability issues when node is shutting down.

rzetelskik commented 9 months ago

@zimnx could we make it an optional feature in node setup? The users would then have a choice between running on PodIPs without any issues, using this workaround, or running a kube-proxy-less cluster, which sounds like reasonably many choices.

zimnx commented 9 months ago

Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout https://github.com/scylladb/scylladb/issues/15899 https://github.com/scylladb/scylladb/issues/15901

tnozicka commented 9 months ago

Great job drilling down into this!

I don't think we want to get into managing contrack settings for a subset of platforms in node setup DS. It seems fragile, potentially affecting other workloads and quite platform dependent. Scylla timeout seems like more fitting and stable place even if we have to wait a bit.

mykaul commented 8 months ago

Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?

mykaul commented 8 months ago

@xemul looked at multiple RPC issues recently, may be he has some idea.

zimnx commented 8 months ago

Are we sure shutdown is not hung, which causes RPC connections not to close properly, from both sides?

Rollout proceeds, process dies some time after we drain it and kill it without need for ungraceful shutdown. Kostia suggested it might be related to too short shutdown-announce-in-ms, but when I set it to 10minutes it didn't help. Details and logs from the run are available under https://github.com/scylladb/scylladb/issues/15899

zimnx commented 8 months ago

After couple of days I have doubts whether timeout on rpc_client connection would solve the issue. I think, it would only postpone fixing the real issue.

The reason why timeout would help us, is because it would give nodes another attempt to connect to restarted node. New attempt usually means new source port, session would have new ip/port tuple which requires iptables traversal. Effectively using updated ClusterIP mapping. And this is what fixes the issue we hit with ClusterIP clusters. The problem is that it's not reliable.

If for some reason, new attempt wouldn't be successful or would take longer to connect, timeout on other nodes wouldn't prevent node being restarted to pass readiness check.
If Scylla would be able to boot within seconds, timeout set to value higher than boot time wouldn't prevent restarted node from passing readiness check.
New source port is not guaranteed. In case it wouldn't be, nodes would be stuck with the same stale conntrack entry preventing from using updated ClusterIP mapping.
Setting timeout to too low value, would cause troubles on environments with slow connection between nodes.

We can force connection timeout by namespaced net.ipv4.tcp_syn_retries which would mimic the proposed timeout only in Scylla Pod. It wouldn't help in all these cases by itself. Additionally we could set non-namespaced net.netfilter.nf_conntrack_tcp_timeout_syn_sent but it would only help with case 3. Although on setups where ClusterIP is NATed not using netfliter but something else, similar situation with stale entry in NAT map may happen. We won't be able to solve it once for all.

Looks like even with configurable timeout, we wouldn't be able to survive rollouts without availability loss, because UN information from single node is not enough. We need to be sure every required connection is up both ways. Gossip is exchanged with one connection. Depending on the timing, some connections may be established, and some may be stuck. If one of the established ones was gossip connection, we could have UN state agreement across all nodes if just gossip exchange is enough to call it UN, I don't know, worth asking core.

Seems like one of the first idea about asking multiple nodes about status of ourselves would be a good way to solve these availability issues. Together with recommendation to set additional sysctls they should make readiness pass quick enough. I wonder if we will be able to ask the status of a given node using raft.

zimnx commented 8 months ago

I wonder if we will be able to ask the status of a given node using raft.

We won't. Status of a node consist of two states, node state (normal, bootstrapping, joining, leaving) and state of failure detector (up/down). Node state will be consistent across cluster with raft, but failure detector state will still be per node.

scylla-operator-bot[bot] commented 3 weeks ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale