scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
58 stars 95 forks source link

decommission_streaming_err - decommissioned node was resurrected in the raft that causes to failure of new node adding #6140

Open juliayakovlev opened 1 year ago

juliayakovlev commented 1 year ago

Nemesis disrupt_decommission_streaming_err

Decommissioning of the node lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-9 has been interrupted by instance reboot. Unbootstrap was completed:

2023-05-16T10:14:47+00:00 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-4     !INFO | scylla[5907]:  [shard  0] storage_service - DECOMMISSIONING: unbootstrap done

Removing tokens was started but not finished. When the node back it was not removed from raft and gossip. Despite this the nodetool removenode was run on the node. It was finished successfully but failed to remove the node from raft:

< t:2023-05-16 10:19:11,331 f:__init__.py     l:122  c:sdcm.utils.raft      p:ERROR > Removenode with host_id 8a4f89f3-3928-4fd2-a230-d18e1f21c274 failed with nodetool: Scylla API server HTTP POST to URL '/storage_service/remove_node' failed: std::runtime_error (removenode[e84ec753-c043-46f7-84dd-c18e4fc932bd]: Rejected removenode operation (node=10.4.0.129); the node being removed is alive, maybe you should use decommission instead?)

As result new node adding was stuck because of raft_group0_upgrade cannot resolve IP of removed node (that remains in the raft):

<May 16 10:22:47 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla[5712]:  [shard  0] init - Scylla version 2023.1.0~rc5-0.20230429.a47bcb26e42e with build-id d2644a8364f13d14d25be6b9d3c69f84612192bd starting ...
May 16 10:22:54 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla[5712]:  [shard  0] raft_group_registry - marking Raft server 1e864ba7-f476-4e65-867b-47d4395bb112 as alive for raft groups
May 16 10:22:54 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla[5712]:  [shard  0] raft_group_registry - marking Raft server b8178169-43f3-4629-8295-3ce85d5311c0 as alive for raft groups
May 16 10:22:54 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla[5712]:  [shard  0] raft_group_registry - marking Raft server accaed5a-cef0-4709-a02d-26f6ae62fd95 as alive for raft groups
May 16 10:22:55 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla[5712]:  [shard  0] raft_group0_upgrade - : failed to resolve IP addresses of some of the cluster members ({8a4f89f3-3928-4fd2-a230-d18e1f21c274})
May 16 10:22:55 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla[5712]:  [shard  0] raft_group0_upgrade - : sleeping for 2s seconds before retrying...
May 16 10:22:56 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla-jmx[5824]: Connecting to http://127.0.0.1:10000
May 16 10:22:56 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla-jmx[5824]: Starting the JMX server
May 16 10:22:57 lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-12 scylla[5712]:  [shard  0] raft_group0_upgrade - : failed to resolve IP addresses of some of the cluster members ({8a4f89f3-3928-4fd2-a230-d18e1f21c274})

Adding of the new node was not completed.

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1035-aws Scylla version (or git commit hash): 2023.1.0~rc5-20230429.a47bcb26e42e with build-id d2644a8364f13d14d25be6b9d3c69f84612192bd

Cluster size: 9 nodes (i3.8xlarge)

Scylla Nodes used in this run:

OS / Image: ami-05e7801837cea47d9 ami-0a87efc6a3d4c2f16 ami-077246f86cd2ada48 (aws: eu-west-1)

Test: longevity-lwt-24h-multidc-test Test id: 0acdc559-18a9-4125-8fba-d7d525d7c686 Test name: enterprise-2023.1/longevity/longevity-lwt-24h-multidc-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 0acdc559-18a9-4125-8fba-d7d525d7c686` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=0acdc559-18a9-4125-8fba-d7d525d7c686) - Show all stored logs command: `$ hydra investigate show-logs 0acdc559-18a9-4125-8fba-d7d525d7c686` ## Logs: - **db-cluster-0acdc559.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/db-cluster-0acdc559.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/db-cluster-0acdc559.tar.gz) - **sct-runner-events-0acdc559.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/sct-runner-events-0acdc559.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/sct-runner-events-0acdc559.tar.gz) - **sct-0acdc559.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/sct-0acdc559.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/sct-0acdc559.log.tar.gz) - **monitor-set-0acdc559.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/monitor-set-0acdc559.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/monitor-set-0acdc559.tar.gz) - **loader-set-0acdc559.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/loader-set-0acdc559.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/loader-set-0acdc559.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/longevity/job/longevity-lwt-24h-multidc-test/5/)
fruch commented 1 year ago

I'm lost we really had that node up when doing removenode ? if it was down, sounds like a scylla issue to me

juliayakovlev commented 1 year ago

I told with @aleksbykov , he is aware about this problem and will send a fix

fruch commented 1 year ago

I told with @aleksbykov , he is aware about this problem and will send a fix

I still couldn't understand what SCT was doing wrong here

juliayakovlev commented 1 year ago

@aleksbykov can explain it better. It's connected to how fast the reboot is completed

aleksbykov commented 1 year ago

@French, sct terminate node a bit latter, than needed. So node is backed to cluster, when it was terminated. And next adding node failed. I need to add one more check before termination or add minute timeout to allow node moved from gossip after termination

fruch commented 1 year ago

@French, sct terminate node a bit latter, than needed. So node is backed to cluster, when it was terminated. And next adding node failed. I need to add one more check before termination or add minute timeout to allow node moved from gossip after termination

I'm only french by proxy.

So we had this all along ? But raft now exposes this issue ?

roydahan commented 1 year ago

LOL :)

roydahan commented 1 year ago

@aleksbykov it's good that we will test this race as well, but we need to adjust the code to handle it.

aleksbykov commented 1 year ago

I am working on fix for it

roydahan commented 1 year ago

@aleksbykov do we have a fix for this one?

aleksbykov commented 1 year ago

@roydahan will be ready today, tomorrow