Open yarongilor opened 1 year ago
What's the bug here? What's the user impact?
@yarongilor waiting for the non-voter message is a convenient way to check that decommission has reached a certain state, but the failure is obviously not with raft, but with streaming the data away from the decommissioned node.
@yarongilor waiting for the non-voter message is a convenient way to check that decommission has reached a certain state, but the failure is obviously not with raft, but with streaming the data away from the decommissioned node.
the problem, AFAIU is that on GCE it is taking more than 20 minutes for the node to get into this state, but on AWS it takes much less --> https://github.com/scylladb/scylla-cluster-tests/issues/6522#issuecomment-1685271472
@yarongilor waiting for the non-voter message is a convenient way to check that decommission has reached a certain state, but the failure is obviously not with raft, but with streaming the data away from the decommissioned node.
the problem, AFAIU is that on GCE it is taking more than 20 minutes for the node to get into this state, but on AWS it takes much less --> scylladb/scylla-cluster-tests#6522 (comment)
@mykaul , i couldn't find an evidence it takes less on other runs. So i guess, if necessary, i can rerun this same test on AWS.
@yarongilor waiting for the non-voter message is a convenient way to check that decommission has reached a certain state, but the failure is obviously not with raft, but with streaming the data away from the decommissioned node.
the problem, AFAIU is that on GCE it is taking more than 20 minutes for the node to get into this state, but on AWS it takes much less --> scylladb/scylla-cluster-tests#6522 (comment)
@mykaul , i couldn't find an evidence it takes less on other runs. So i guess, if necessary, i can rerun this same test on AWS.
Let's begin by understanding what's taking so long for the node. What is the time spent on.
@yarongilor this is a new behaviour that @aleksbykov added recently. There are 2 things you should check:
@kbr-scylla - can you please advice? it looks like decommission starts a repair first, then only later on, the node becomes non-voter. is it the expected functionality? would it be better for the test to look for a different log message in this case?
is it the expected functionality?
Yes
would it be better for the test to look for a different log message in this case?
But what is your goal?
Here's piece from the decommission code:
slogger.info("DECOMMISSIONING: starts");
ctl.req.leaving_nodes = std::list<gms::inet_address>{endpoint};
assert(ss._group0);
bool raft_available = ss._group0->wait_for_raft().get();
try {
// Step 2: Start heartbeat updater
ctl.start_heartbeat_updater(node_ops_cmd::decommission_heartbeat);
// Step 3: Prepare to sync data
ctl.prepare(node_ops_cmd::decommission_prepare).get();
// Step 4: Start to sync data
slogger.info("DECOMMISSIONING: unbootstrap starts");
ss.unbootstrap().get();
on_streaming_finished();
slogger.info("DECOMMISSIONING: unbootstrap done");
// Step 5: Become a group 0 non-voter before leaving the token ring.
//
// Thanks to this, even if we fail after leaving the token ring but before leaving group 0,
// group 0's availability won't be reduced.
if (raft_available) {
slogger.info("decommission[{}]: becoming a group 0 non-voter", uuid);
ss._group0->become_nonvoter().get();
slogger.info("decommission[{}]: became a group 0 non-voter", uuid);
}
// Step 6: Verify that other nodes didn't abort in the meantime.
// See https://github.com/scylladb/scylladb/issues/12989.
ctl.query_pending_op().get();
// Step 7: Leave the token ring
slogger.info("decommission[{}]: leaving token ring", uuid);
ss.leave_ring().get();
left_token_ring = true;
slogger.info("decommission[{}]: left token ring", uuid);
Take one of these messages, depending on what you want to wait for.
(unbootstrap()
is where streaming/repair happens.)
is it the expected functionality?
Yes
would it be better for the test to look for a different log message in this case?
But what is your goal?
Here's piece from the decommission code:
slogger.info("DECOMMISSIONING: starts"); ctl.req.leaving_nodes = std::list<gms::inet_address>{endpoint}; assert(ss._group0); bool raft_available = ss._group0->wait_for_raft().get(); try { // Step 2: Start heartbeat updater ctl.start_heartbeat_updater(node_ops_cmd::decommission_heartbeat); // Step 3: Prepare to sync data ctl.prepare(node_ops_cmd::decommission_prepare).get(); // Step 4: Start to sync data slogger.info("DECOMMISSIONING: unbootstrap starts"); ss.unbootstrap().get(); on_streaming_finished(); slogger.info("DECOMMISSIONING: unbootstrap done"); // Step 5: Become a group 0 non-voter before leaving the token ring. // // Thanks to this, even if we fail after leaving the token ring but before leaving group 0, // group 0's availability won't be reduced. if (raft_available) { slogger.info("decommission[{}]: becoming a group 0 non-voter", uuid); ss._group0->become_nonvoter().get(); slogger.info("decommission[{}]: became a group 0 non-voter", uuid); } // Step 6: Verify that other nodes didn't abort in the meantime. // See https://github.com/scylladb/scylladb/issues/12989. ctl.query_pending_op().get(); // Step 7: Leave the token ring slogger.info("decommission[{}]: leaving token ring", uuid); ss.leave_ring().get(); left_token_ring = true; slogger.info("decommission[{}]: left token ring", uuid);
Take one of these messages, depending on what you want to wait for. (
unbootstrap()
is where streaming/repair happens.)
thanks @kbr-scylla. i guess it is ok to cover the point of "became a group 0 non-voter" as well, so the test will try interrupt rebooting the node right before/during leaving the token ring. @roydahan , @aleksbykov , if it sounds reasonable, then we can just align the timeout of waiting for this message with the decommission-duration timeout.
@yarongilor please try.
@yarongilor please try.
@roydahan , it was tested and ran ok in https://github.com/scylladb/scylla-cluster-tests/pull/6523#issuecomment-1704289301
I moved it back to SCT- Not a scylla issue.
@yarongilor did you make sure that this run you tested was actually waiting for this specific state? I remind you that it's randomly waiting for steps, but controlled by nemesis_seed.
I moved it back to SCT- Not a scylla issue.
@yarongilor did you make sure that this run you tested was actually waiting for this specific state? I remind you that it's randomly waiting for steps, but controlled by nemesis_seed.
@roydahan , this run tested indeed one of the patterns in the list. i didn't test the other ones.
The full list is:["became a group 0 non-voter", "leaving token ring", "left token ring", "Finished token ring movement"]
The test output is like:
< t:2023-09-03 13:46:31,082 f:nemesis.py l:3698 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 13:49:37,632 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T13:47:12+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-3 !INFO | scylla[6882]: [shard 0] raft_group0 - became a non-voter.
< t:2023-09-03 13:49:37,633 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T13:47:12+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-3 !INFO | scylla[6882]: [shard 0] storage_service - decommission[332bb7b0-268b-4e3a-a096-07ba0932242f]: became a group 0 non-voter
< t:2023-09-03 13:58:53,700 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T13:58:53+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-6 !INFO | scylla[7589]: [shard 0] raft_group0 - finish_setup_after_join: became a group 0 voter.
< t:2023-09-03 14:07:58,401 f:nemesis.py l:3698 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 14:10:41,999 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T14:08:57+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-1 !INFO | scylla[6833]: [shard 0] raft_group0 - became a non-voter.
< t:2023-09-03 14:10:41,999 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T14:08:57+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-1 !INFO | scylla[6833]: [shard 0] storage_service - decommission[d631036f-3ded-4cf7-ba66-cd4492dc89b6]: became a group 0 non-voter
< t:2023-09-03 14:20:13,541 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T14:20:12+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-7 !INFO | scylla[7918]: [shard 0] raft_group0 - finish_setup_after_join: became a group 0 voter.
< t:2023-09-03 14:29:20,811 f:nemesis.py l:3698 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 14:41:22,384 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T14:41:22+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-8 !INFO | scylla[8014]: [shard 0] raft_group0 - finish_setup_after_join: became a group 0 voter.
< t:2023-09-03 14:49:46,734 f:nemesis.py l:3698 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 15:02:03,500 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T15:02:02+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-9 !INFO | scylla[7633]: [shard 0] raft_group0 - finish_setup_after_join: became a group 0 voter.
< t:2023-09-03 15:11:05,323 f:nemesis.py l:3698 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 15:13:06,299 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T15:12:13+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-6 !INFO | scylla[7589]: [shard 0] raft_group0 - became a non-voter.
< t:2023-09-03 15:13:06,299 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T15:12:13+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-6 !INFO | scylla[7589]: [shard 0] storage_service - decommission[4b14b4d6-f84f-4799-9b27-5cde52a15a43]: became a group 0 non-voter
< t:2023-09-03 15:22:50,107 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-09-03T15:22:49+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-10 !INFO | scylla[7625]: [shard 0] raft_group0 - finish_setup_after_join: became a group 0 voter.
Not clear.
On Tue, Sep 12, 2023 at 13:25 Yaron Gilor @.***> wrote:
Addressed in #6522 https://github.com/scylladb/scylla-cluster-tests/issues/6522
— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-cluster-tests/issues/6572#issuecomment-1715457084, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE75CYENY6AGUM5GMNH6MGTX2A2CNANCNFSM6AAAAAA4KFZZ6I . You are receiving this because you were mentioned.Message ID: @.***>
Issue description
The nemesis of DecommissionStreamingErr decommissioned node-3 The test then waits for node-3 log message of: 'became a group 0 non-voter'. The test itself failed for 10 minutes timeout wait. It is expected to take less time than that. It then took node-3 22 minutes until getting this state where this message printed to log.
node-3 during decommission is reported as 'UL':
The log message is received after 22 minutes:
The SCT nemesis code eventually failed with:
And with:
Impact
not sure if it is right flow of decommission and if decommission runs well.
How frequently does it reproduce?
reproduced.
Installation details
Kernel Version: 5.15.0-1038-gcp Scylla version (or git commit hash):
5.4.0~dev-20230812.d1d1b6cf6e01
with build-id6c4f55c26164d6fe2cd25d38f5022795ce696d9c
Cluster size: 5 nodes (n2-highmem-16)
Scylla Nodes used in this run:
OS / Image:
https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-5-4-0-dev-x86-64-2023-08-12t02-57-40
(gce: undefined_region)Test:
longevity-large-partition-200k-pks-4days-gce-test
Test id:7ca890ce-8673-425e-aed2-e336b4a60d95
Test name:scylla-master/longevity/longevity-large-partition-200k-pks-4days-gce-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 7ca890ce-8673-425e-aed2-e336b4a60d95` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7ca890ce-8673-425e-aed2-e336b4a60d95) - Show all stored logs command: `$ hydra investigate show-logs 7ca890ce-8673-425e-aed2-e336b4a60d95` ## Logs: - **db-cluster-7ca890ce.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/db-cluster-7ca890ce.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/db-cluster-7ca890ce.tar.gz) - **sct-runner-events-7ca890ce.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/sct-runner-events-7ca890ce.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/sct-runner-events-7ca890ce.tar.gz) - **sct-7ca890ce.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/sct-7ca890ce.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/sct-7ca890ce.log.tar.gz) - **loader-set-7ca890ce.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/loader-set-7ca890ce.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/loader-set-7ca890ce.tar.gz) - **monitor-set-7ca890ce.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/monitor-set-7ca890ce.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/monitor-set-7ca890ce.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-large-partition-200k-pks-4days-gce-test/9/) [Argus](https://argus.scylladb.com/test/917e825f-11f9-4493-acdb-ec5266a3af78/runs?additionalRuns[]=7ca890ce-8673-425e-aed2-e336b4a60d95)