[GKE] Decommission is triggered on a new node when previous decommission was quick

soyacz commented 1 year ago

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

When running disrupt_terminate_kubernetes_host_then_decommission_and_add_scylla_node nemesis on GKE, after terminated host is brought back to k8s we file decommission:

< t:2023-08-22 17:04:00,377 f:__init__.py     l:2484 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `2' in sct-cluster's spec

then we wait for scylla pod to vanish, which is quick as pod still does not exist because of previous VM termination:

< t:2023-08-22 17:04:00,580 f:local_cmd_runner.py l:62   c:LocalCmdRunner       p:DEBUG > Running command "kubectl --cache-dir=/home/ubuntu/sct-results/20230822-111629-619180/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2"...
< t:2023-08-22 17:04:01,198 f:base.py         l:142  c:LocalCmdRunner       p:DEBUG > Command "kubectl --cache-dir=/home/ubuntu/sct-results/20230822-111629-619180/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2" finished with status 0

Then we add new node with this:

< t:2023-08-22 17:04:10,664 f:__init__.py     l:2484 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `3' in sct-cluster's spec

But as soon as node starts, it gets decommissioned:

< t:2023-08-22 17:06:07,583 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > INFO  2023-08-22 14:59:21,057 [shard  0] init - Scylla version 2023.1.0-0.20230813.68e9cef1baf7 initialization completed.
< t:2023-08-22 17:06:07,583 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > INFO  2023-08-22 14:59:23,833 [shard  9] api - decommission
< t:2023-08-22 17:06:07,583 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > INFO  2023-08-22 14:59:23,839 [shard  0] storage_service - DECOMMISSIONING: starts

Impact

New node gets decommissioned while it shouldn't.

How frequently does it reproduce?

Happened 2 times in a row on GKE cluster.

Installation details

Kernel Version: 5.15.0-1038-gke Scylla version (or git commit hash): 2023.1.0-20230813.68e9cef1baf7 with build-id c7f9855620b984af24957d7ab0bd8054306d182e

Operator Image: scylladb/scylla-operator:1.10.0-rc.0 Operator Helm Version: 1.10.0-rc.0 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (n1-highmem-16)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: scylladb/scylladb:2023.1.0 (k8s-gke: undefined_region)

Test: longevity-scylla-operator-basic-12h-gke Test id: a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6 Test name: scylla-operator/operator-1.10/gke/longevity-scylla-operator-basic-12h-gke Test config file(s):

longevity-scylla-operator-basic-12h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6) - Show all stored logs command: `$ hydra investigate show-logs a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6` ## Logs: - **kubernetes-a61d77ae.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/kubernetes-a61d77ae.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/kubernetes-a61d77ae.tar.gz) - **db-cluster-a61d77ae.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/db-cluster-a61d77ae.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/db-cluster-a61d77ae.tar.gz) - **sct-runner-events-a61d77ae.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/sct-runner-events-a61d77ae.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/sct-runner-events-a61d77ae.tar.gz) - **sct-a61d77ae.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/sct-a61d77ae.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/sct-a61d77ae.log.tar.gz) - **loader-set-a61d77ae.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/loader-set-a61d77ae.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/loader-set-a61d77ae.tar.gz) - **monitor-set-a61d77ae.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/monitor-set-a61d77ae.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6/20230823_004446/monitor-set-a61d77ae.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-1.10/job/gke/job/longevity-scylla-operator-basic-12h-gke/3/) [Argus](https://argus.scylladb.com/test/051cfffd-f971-4401-ba07-1bf611abc343/runs?additionalRuns[]=a61d77ae-7a9c-4c4f-bd46-8782bf50a6c6)

tnozicka commented 1 year ago

then we wait for scylla pod to vanish, which is quick as pod still does not exist because of previous VM termination:

there are other resources then pods that need to be removed / adjusted. You should wait for the Available=true+Progressing=False+Degraded=false+status.observedGeneration>=metadata.generation

vponomaryov commented 1 year ago

@tnozicka Sure, we can wait. But from the logical point of view it is unexpected behavior. We do following: Reduce the num of replicas in the ScyllaCluster object from 3 to 2. As a user, I expect 3rd pod to be removed and if nothing to remove then no scheduled decommission presence. Then we increase the replica number from 2 to 3 and see that new 3rd pod gets decommissioned. It is non-sense.

tnozicka commented 1 year ago

My point is that you didn't wait for the scale down to 2 pods to finish, so while you say you desire 3 pods again, it's still in a state where it didn't finish the scale down to 2 pods and it needs to finish it first before it proceeds to your new desired state of 3 pods. Guess in theory it shouldn't create the 3rd pod again too soon, but the important part is whether it goes to the desired state eventually.

From the test is seems your intention is to wait for the scale down to complete but you're not doing it correctly. Unless your intention is to trigger scale up before scale down finishes - is it?

tnozicka commented 1 year ago

to clarify it, I don't think the new pod should be created before the decommission completes and the svc is deleted (just that you might not wanted to test this case), so I'll keep it open for that case.

@zimnx 2023.1.0-20230813.68e9cef1baf7 should use the new hostid replacement, can you please assess whether it regressed or it was an issue present in the previous procedure as well?

vponomaryov commented 1 year ago

@tnozicka The idea behind our test is to get to know that we can decommission disrupted Scylla pod and then create a replacement for it.

Then, yes, the new pod must not be created before the scale down procedure gets finished.

zimnx commented 1 year ago

2023.1.0-20230813.68e9cef1baf7 should use the new hostid replacement, can you please assess whether it regressed or it was an issue present in the previous procedure as well?

decommission is not using node replacement procedure at all, these are two separate procedures.

tnozicka commented 1 year ago

mixed it up, as this wasn't changed it's unlikely to be an regression, moving into next sprint to investigate why there is the extra pod too early.

The idea behind our test is to get to know that we can decommission disrupted Scylla pod and then create a replacement for it. Then, yes, the new pod must not be created before the scale down procedure gets finished.

Does it eventually get into the correct state?

vponomaryov commented 1 year ago

Does it eventually get into the correct state?

No.

Picture of that pod state:

INFO  2023-08-22 17:11:46,230 [shard  0] storage_service - DECOMMISSIONING: done
INFO  2023-08-22 17:11:46,999 [shard  0] raft_group_registry - marking Raft server 45b0f678-2d17-402d-8a9f-4909716c7ad6 as dead for raft groups
INFO  2023-08-22 17:11:47,111 [shard  0] raft_group_registry - marking Raft server ae8d49e3-92ed-43f5-9c53-3a57eb564c5b as dead for raft groups
INFO  2023-08-22 17:11:47,146 [shard  0] raft_group_registry - marking Raft server fe3c2f4c-99c9-4906-bbe8-097236702135 as dead for raft groups
I0822 17:11:47.999666       1 sidecar/sync.go:93] "The node is already decommissioned"
E0822 17:11:48.014973       1 sidecar/controller.go:143] syncing key 'scylla/sct-cluster-us-east1-b-us-east1-2' failed: can't sync the HostID annotation: local host ID "ae8d49e3-92ed-43f5-9c53-3a57eb564c5b" not found in IP to hostID mapping: map[10.72.5.81:fe3c2f4c-99c9-4906-bbe8-097236702135 10.72.9.208:45b0f678-2d17-402d-8a9f-4909716c7ad6]
E0822 17:11:48.015801       1 sidecar/controller.go:143] syncing key 'scylla/sct-cluster-us-east1-b-us-east1-2' failed: can't sync the HostID annotation: local host ID "ae8d49e3-92ed-43f5-9c53-3a57eb564c5b" not found in IP to hostID mapping: map[10.72.5.81:fe3c2f4c-99c9-4906-bbe8-097236702135 10.72.9.208:45b0f678-2d17-402d-8a9f-4909716c7ad6]
I0822 17:11:52.756869       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"
...
E0823 00:29:16.657373       1 sidecar/controller.go:143] syncing key 'scylla/sct-cluster-us-east1-b-us-east1-2' failed: can't sync the HostID annotation: local host ID "ae8d49e3-92ed-43f5-9c53-3a57eb564c5b" not found in IP to hostID mapping: map[10.72.5.81:fe3c2f4c-99c9-4906-bbe8-097236702135 10.72.9.208:45b0f678-2d17-402d-8a9f-4909716c7ad6]
I0823 00:29:22.755654       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"
I0823 00:29:32.755892       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"

tnozicka commented 1 year ago

ok, then it sounds like https://github.com/scylladb/scylla-operator/issues/1189

vponomaryov commented 1 year ago

ok, then it sounds like #1189

Yes. Looks like it is duplicate.

tnozicka commented 1 year ago

ok, I've initially understood as it misbehaves only in the interim. I'll close it as dupe of #1189 and try to see if we can find cycles next print. Thank you.

scylladb / scylla-operator