Open juliayakovlev opened 1 year ago
I'm lost we really had that node up when doing removenode ? if it was down, sounds like a scylla issue to me
I told with @aleksbykov , he is aware about this problem and will send a fix
I told with @aleksbykov , he is aware about this problem and will send a fix
I still couldn't understand what SCT was doing wrong here
@aleksbykov can explain it better. It's connected to how fast the reboot is completed
@French, sct terminate node a bit latter, than needed. So node is backed to cluster, when it was terminated. And next adding node failed. I need to add one more check before termination or add minute timeout to allow node moved from gossip after termination
@French, sct terminate node a bit latter, than needed. So node is backed to cluster, when it was terminated. And next adding node failed. I need to add one more check before termination or add minute timeout to allow node moved from gossip after termination
I'm only french by proxy.
So we had this all along ? But raft now exposes this issue ?
LOL :)
@aleksbykov it's good that we will test this race as well, but we need to adjust the code to handle it.
I am working on fix for it
@aleksbykov do we have a fix for this one?
@roydahan will be ready today, tomorrow
Nemesis
disrupt_decommission_streaming_err
Decommissioning of the node
lwt-longevity-multi-dc-24h-2023-1-db-node-0acdc559-9
has been interrupted by instance reboot. Unbootstrap was completed:Removing tokens was started but not finished. When the node back it was not removed from raft and gossip. Despite this the
nodetool removenode
was run on the node. It was finished successfully but failed to remove the node from raft:As result new node adding was stuck because of raft_group0_upgrade cannot resolve IP of removed node (that remains in the raft):
Adding of the new node was not completed.
Issue description
Describe your issue in detail and steps it took to produce it.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1035-aws Scylla version (or git commit hash):
2023.1.0~rc5-20230429.a47bcb26e42e
with build-idd2644a8364f13d14d25be6b9d3c69f84612192bd
Cluster size: 9 nodes (i3.8xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-05e7801837cea47d9 ami-0a87efc6a3d4c2f16 ami-077246f86cd2ada48
(aws: eu-west-1)Test:
longevity-lwt-24h-multidc-test
Test id:0acdc559-18a9-4125-8fba-d7d525d7c686
Test name:enterprise-2023.1/longevity/longevity-lwt-24h-multidc-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 0acdc559-18a9-4125-8fba-d7d525d7c686` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=0acdc559-18a9-4125-8fba-d7d525d7c686) - Show all stored logs command: `$ hydra investigate show-logs 0acdc559-18a9-4125-8fba-d7d525d7c686` ## Logs: - **db-cluster-0acdc559.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/db-cluster-0acdc559.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/db-cluster-0acdc559.tar.gz) - **sct-runner-events-0acdc559.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/sct-runner-events-0acdc559.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/sct-runner-events-0acdc559.tar.gz) - **sct-0acdc559.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/sct-0acdc559.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/sct-0acdc559.log.tar.gz) - **monitor-set-0acdc559.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/monitor-set-0acdc559.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/monitor-set-0acdc559.tar.gz) - **loader-set-0acdc559.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/loader-set-0acdc559.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0acdc559-18a9-4125-8fba-d7d525d7c686/20230516_171823/loader-set-0acdc559.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/longevity/job/longevity-lwt-24h-multidc-test/5/)