A decommissioned node gets to a state of 'became a group 0 non-voter' after 22 minutes

yarongilor commented 1 year ago

Issue description

[ ] This issue is a regression.
[x] It is unknown if this issue is a regression.

The nemesis of DecommissionStreamingErr decommissioned node-3 The test then waits for node-3 log message of: 'became a group 0 non-voter'. The test itself failed for 10 minutes timeout wait. It is expected to take less time than that. It then took node-3 22 minutes until getting this state where this message printed to log.

>>>>>>>>>>>>>Started random_disrupt_method decommission_streaming_err

< t:2023-08-12 14:54:15,501 f:nemesis.py      l:1080 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Set current_disruption -> DecommissionStreamingErr Node longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3 [34.138.253.178 | 10.142.0.208] (seed: True)

< t:2023-08-12 14:53:53,814 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.142.0.208  1.23 TB    256          ?       ee1b70ec-116a-46b4-aed1-7ade62dab9db  d
< t:2023-08-12 14:54:05,549 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UN  10.142.0.208  1.23 TB    256          ?       ee1b70ec-116a-46b4-aed1-7ade62dab9db  d

< t:2023-08-12 14:54:16,529 f:nemesis.py      l:3627 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-08-12 14:54:16,539 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool  decommission "...

2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - DECOMMISSIONING: starts
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] raft_group0 - Performing a group 0 read barrier...
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] raft_group0 - Finished group 0 read barrier.
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - decommission[453de15d-7473-44dc-90db-dfbe1f7475e3]: Started heartbeat_updater (interval=10s)
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - decommission[453de15d-7473-44dc-90db-dfbe1f7475e3]: Started decommission_prepare[453de15d-7473-44dc-90db-dfbe1f7475e3]: ignore_nodes={}, leaving_nodes={10.142.0.208}, replace_nodes={}, bootstrap_nodes={}, repair_tables={}
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - decommission[453de15d-7473-44dc-90db-dfbe1f7475e3]: Added node=10.142.0.208 as leaving node, coordinator=10.142.0.208
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - decommission[453de15d-7473-44dc-90db-dfbe1f7475e3]: Finished decommission_prepare[453de15d-7473-44dc-90db-dfbe1f7475e3]: ignore_nodes={}, leaving_nodes={10.142.0.208}, replace_nodes={}, bootstrap_nodes={}, repair_tables={}
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - DECOMMISSIONING: unbootstrap starts
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - Started batchlog replay for decommission
2023-08-12T14:54:18+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - Finished batchlog replay for decommission

node-3 during decommission is reported as 'UL':

< t:2023-08-12 15:04:18,814 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UL  10.142.0.208  1.27 TB    256          ?       ee1b70ec-116a-46b4-aed1-7ade62dab9db  d
< t:2023-08-12 15:04:22,611 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > UL  10.142.0.208  1.27 TB    256          ?       ee1b70ec-116a-46b4-aed1-7ade62dab9db  d

The log message is received after 22 minutes:

db-cluster-7ca890ce/longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3$ grep 'became a group 0 non-voter' messages.log -A 5 -B 5
2023-08-12T15:16:08+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] raft_group0 - became a non-voter.
2023-08-12T15:16:08+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - decommission[453de15d-7473-44dc-90db-dfbe1f7475e3]: became a group 0 non-voter
2023-08-12T15:16:08+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard  0] storage_service - decommission[453de15d-7473-44dc-90db-dfbe1f7475e3]: leaving token ring

The SCT nemesis code eventually failed with:

2023-08-12 15:14:03.293: (ClusterHealthValidatorEvent Severity.ERROR) period_type=one-time event_id=fed89919-311d-441e-ac05-b99f946ffd41: type=NodeStatus node=longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3 error=Current node Node longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3 [34.138.253.178 | 10.142.0.208] (seed: True). Wrong node status. Node Node longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3 [34.138.253.178 | 10.142.0.208] (seed: True) status in nodetool.status is UL, but status in gossip NORMAL

2023-08-12T15:14:06+00:00 longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3     !INFO | scylla[33084]:  [shard 12] repair - repair[7213425b-affc-4797-a5d7-7ac8320a3630]: Started to repair 3 out of 3 tables in keyspace=scylla_bench, table=test_counters, table_id=5f881d20-38c6-11ee-b370-ddcdb35ec7f7, repair_reason=decommission

And with:

2023-08-12 15:10:18.125: (DisruptionEvent Severity.ERROR) period_type=end event_id=252949c6-f553-4cab-b0ed-fbabb1440a11 duration=16m2s: nemesis_name=DecommissionStreamingErr target_node=Node longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3 [34.138.253.178 | 10.142.0.208] (seed: True) errors=Not all nodes joined the cluster
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4907, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3725, in disrupt_decommission_streaming_err
self.start_and_interrupt_decommission_streaming()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3651, in start_and_interrupt_decommission_streaming
self.target_node.wait_node_fully_start()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3120, in wait_node_fully_start
self.parent_cluster.wait_for_nodes_up_and_normal(nodes=[self])
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4310, in wait_for_nodes_up_and_normal
_wait_for_nodes_up_and_normal()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4308, in _wait_for_nodes_up_and_normal
self.check_nodes_up_and_normal(nodes=nodes, verification_node=verification_node)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4277, in check_nodes_up_and_normal
raise ClusterNodesNotReady("Not all nodes joined the cluster")
sdcm.cluster.ClusterNodesNotReady: Not all nodes joined the cluster

Impact

not sure if it is right flow of decommission and if decommission runs well.

How frequently does it reproduce?

reproduced.

Installation details

Kernel Version: 5.15.0-1038-gcp Scylla version (or git commit hash): 5.4.0~dev-20230812.d1d1b6cf6e01 with build-id 6c4f55c26164d6fe2cd25d38f5022795ce696d9c

Cluster size: 5 nodes (n2-highmem-16)

Scylla Nodes used in this run:

longevity-large-partitions-200k-pks-db-node-7ca890ce-0-7 (35.243.254.159 | 10.142.0.41) (shards: 14)
longevity-large-partitions-200k-pks-db-node-7ca890ce-0-6 (35.227.5.228 | 10.142.0.8) (shards: 14)
longevity-large-partitions-200k-pks-db-node-7ca890ce-0-5 (34.75.172.34 | 10.142.0.211) (shards: 14)
longevity-large-partitions-200k-pks-db-node-7ca890ce-0-4 (34.138.200.0 | 10.142.0.210) (shards: 14)
longevity-large-partitions-200k-pks-db-node-7ca890ce-0-3 (34.138.253.178 | 10.142.0.208) (shards: 14)
longevity-large-partitions-200k-pks-db-node-7ca890ce-0-2 (35.231.232.76 | 10.142.0.204) (shards: 14)
longevity-large-partitions-200k-pks-db-node-7ca890ce-0-1 (35.185.10.72 | 10.142.0.198) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-5-4-0-dev-x86-64-2023-08-12t02-57-40 (gce: undefined_region)

Test: longevity-large-partition-200k-pks-4days-gce-test Test id: 7ca890ce-8673-425e-aed2-e336b4a60d95 Test name: scylla-master/longevity/longevity-large-partition-200k-pks-4days-gce-test Test config file(s):

longevity-large-partition-200k_pks-4days.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 7ca890ce-8673-425e-aed2-e336b4a60d95` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7ca890ce-8673-425e-aed2-e336b4a60d95) - Show all stored logs command: `$ hydra investigate show-logs 7ca890ce-8673-425e-aed2-e336b4a60d95` ## Logs: - **db-cluster-7ca890ce.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/db-cluster-7ca890ce.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/db-cluster-7ca890ce.tar.gz) - **sct-runner-events-7ca890ce.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/sct-runner-events-7ca890ce.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/sct-runner-events-7ca890ce.tar.gz) - **sct-7ca890ce.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/sct-7ca890ce.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/sct-7ca890ce.log.tar.gz) - **loader-set-7ca890ce.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/loader-set-7ca890ce.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/loader-set-7ca890ce.tar.gz) - **monitor-set-7ca890ce.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/monitor-set-7ca890ce.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7ca890ce-8673-425e-aed2-e336b4a60d95/20230812_152740/monitor-set-7ca890ce.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-large-partition-200k-pks-4days-gce-test/9/) [Argus](https://argus.scylladb.com/test/917e825f-11f9-4493-acdb-ec5266a3af78/runs?additionalRuns[]=7ca890ce-8673-425e-aed2-e336b4a60d95)

mykaul commented 1 year ago

What's the bug here? What's the user impact?

kostja commented 1 year ago

@yarongilor waiting for the non-voter message is a convenient way to check that decommission has reached a certain state, but the failure is obviously not with raft, but with streaming the data away from the decommissioned node.

fgelcer commented 1 year ago

@yarongilor waiting for the non-voter message is a convenient way to check that decommission has reached a certain state, but the failure is obviously not with raft, but with streaming the data away from the decommissioned node.

the problem, AFAIU is that on GCE it is taking more than 20 minutes for the node to get into this state, but on AWS it takes much less --> https://github.com/scylladb/scylla-cluster-tests/issues/6522#issuecomment-1685271472

yarongilor commented 1 year ago

@yarongilor waiting for the non-voter message is a convenient way to check that decommission has reached a certain state, but the failure is obviously not with raft, but with streaming the data away from the decommissioned node.

the problem, AFAIU is that on GCE it is taking more than 20 minutes for the node to get into this state, but on AWS it takes much less --> scylladb/scylla-cluster-tests#6522 (comment)

@mykaul , i couldn't find an evidence it takes less on other runs. So i guess, if necessary, i can rerun this same test on AWS.

mykaul commented 1 year ago

@yarongilor waiting for the non-voter message is a convenient way to check that decommission has reached a certain state, but the failure is obviously not with raft, but with streaming the data away from the decommissioned node.

the problem, AFAIU is that on GCE it is taking more than 20 minutes for the node to get into this state, but on AWS it takes much less --> scylladb/scylla-cluster-tests#6522 (comment)

@mykaul , i couldn't find an evidence it takes less on other runs. So i guess, if necessary, i can rerun this same test on AWS.

Let's begin by understanding what's taking so long for the node. What is the time spent on.

roydahan commented 1 year ago

@yarongilor this is a new behaviour that @aleksbykov added recently. There are 2 things you should check:

There is no gap in the log messages queue so messages arrive late.
With the help of @aleksbykov maybe or whoever helped him to define this stages make sure that this stage doesn't need to wait for the node to completely stream all the data out like @kostja suggested. otherwise for sure 10 mins are not enough for most datasets we have.

yarongilor commented 1 year ago

@kbr-scylla - can you please advice? it looks like decommission starts a repair first, then only later on, the node becomes non-voter. is it the expected functionality? would it be better for the test to look for a different log message in this case?

kbr-scylla commented 1 year ago

is it the expected functionality?

Yes

would it be better for the test to look for a different log message in this case?

But what is your goal?

Here's piece from the decommission code:

                slogger.info("DECOMMISSIONING: starts");
                ctl.req.leaving_nodes = std::list<gms::inet_address>{endpoint};

                assert(ss._group0);
                bool raft_available = ss._group0->wait_for_raft().get();

                try {
                    // Step 2: Start heartbeat updater
                    ctl.start_heartbeat_updater(node_ops_cmd::decommission_heartbeat);

                    // Step 3: Prepare to sync data
                    ctl.prepare(node_ops_cmd::decommission_prepare).get();

                    // Step 4: Start to sync data
                    slogger.info("DECOMMISSIONING: unbootstrap starts");
                    ss.unbootstrap().get();
                    on_streaming_finished();
                    slogger.info("DECOMMISSIONING: unbootstrap done");

                    // Step 5: Become a group 0 non-voter before leaving the token ring.
                    //
                    // Thanks to this, even if we fail after leaving the token ring but before leaving group 0,
                    // group 0's availability won't be reduced.
                    if (raft_available) {
                        slogger.info("decommission[{}]: becoming a group 0 non-voter", uuid);
                        ss._group0->become_nonvoter().get();
                        slogger.info("decommission[{}]: became a group 0 non-voter", uuid);
                    }

                    // Step 6: Verify that other nodes didn't abort in the meantime.
                    // See https://github.com/scylladb/scylladb/issues/12989.
                    ctl.query_pending_op().get();

                    // Step 7: Leave the token ring
                    slogger.info("decommission[{}]: leaving token ring", uuid);
                    ss.leave_ring().get();
                    left_token_ring = true;
                    slogger.info("decommission[{}]: left token ring", uuid);

Take one of these messages, depending on what you want to wait for. (unbootstrap() is where streaming/repair happens.)

yarongilor commented 1 year ago

is it the expected functionality?

Yes

would it be better for the test to look for a different log message in this case?

But what is your goal?

Here's piece from the decommission code:

                slogger.info("DECOMMISSIONING: starts");
                ctl.req.leaving_nodes = std::list<gms::inet_address>{endpoint};

                assert(ss._group0);
                bool raft_available = ss._group0->wait_for_raft().get();

                try {
                    // Step 2: Start heartbeat updater
                    ctl.start_heartbeat_updater(node_ops_cmd::decommission_heartbeat);

                    // Step 3: Prepare to sync data
                    ctl.prepare(node_ops_cmd::decommission_prepare).get();

                    // Step 4: Start to sync data
                    slogger.info("DECOMMISSIONING: unbootstrap starts");
                    ss.unbootstrap().get();
                    on_streaming_finished();
                    slogger.info("DECOMMISSIONING: unbootstrap done");

                    // Step 5: Become a group 0 non-voter before leaving the token ring.
                    //
                    // Thanks to this, even if we fail after leaving the token ring but before leaving group 0,
                    // group 0's availability won't be reduced.
                    if (raft_available) {
                        slogger.info("decommission[{}]: becoming a group 0 non-voter", uuid);
                        ss._group0->become_nonvoter().get();
                        slogger.info("decommission[{}]: became a group 0 non-voter", uuid);
                    }

                    // Step 6: Verify that other nodes didn't abort in the meantime.
                    // See https://github.com/scylladb/scylladb/issues/12989.
                    ctl.query_pending_op().get();

                    // Step 7: Leave the token ring
                    slogger.info("decommission[{}]: leaving token ring", uuid);
                    ss.leave_ring().get();
                    left_token_ring = true;
                    slogger.info("decommission[{}]: left token ring", uuid);

Take one of these messages, depending on what you want to wait for. (unbootstrap() is where streaming/repair happens.)

thanks @kbr-scylla. i guess it is ok to cover the point of "became a group 0 non-voter" as well, so the test will try interrupt rebooting the node right before/during leaving the token ring. @roydahan , @aleksbykov , if it sounds reasonable, then we can just align the timeout of waiting for this message with the decommission-duration timeout.

roydahan commented 1 year ago

@yarongilor please try.

yarongilor commented 1 year ago

@yarongilor please try.

@roydahan , it was tested and ran ok in https://github.com/scylladb/scylla-cluster-tests/pull/6523#issuecomment-1704289301

roydahan commented 1 year ago

I moved it back to SCT- Not a scylla issue.

@yarongilor did you make sure that this run you tested was actually waiting for this specific state? I remind you that it's randomly waiting for steps, but controlled by nemesis_seed.

yarongilor commented 1 year ago

I moved it back to SCT- Not a scylla issue.

@yarongilor did you make sure that this run you tested was actually waiting for this specific state? I remind you that it's randomly waiting for steps, but controlled by nemesis_seed.

@roydahan , this run tested indeed one of the patterns in the list. i didn't test the other ones. The full list is:["became a group 0 non-voter", "leaving token ring", "left token ring", "Finished token ring movement"] The test output is like:

< t:2023-09-03 13:46:31,082 f:nemesis.py      l:3698 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 13:49:37,632 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T13:47:12+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-3     !INFO | scylla[6882]:  [shard  0] raft_group0 - became a non-voter.
< t:2023-09-03 13:49:37,633 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T13:47:12+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-3     !INFO | scylla[6882]:  [shard  0] storage_service - decommission[332bb7b0-268b-4e3a-a096-07ba0932242f]: became a group 0 non-voter
< t:2023-09-03 13:58:53,700 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T13:58:53+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-6     !INFO | scylla[7589]:  [shard  0] raft_group0 - finish_setup_after_join: became a group 0 voter.
< t:2023-09-03 14:07:58,401 f:nemesis.py      l:3698 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 14:10:41,999 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T14:08:57+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-1     !INFO | scylla[6833]:  [shard  0] raft_group0 - became a non-voter.
< t:2023-09-03 14:10:41,999 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T14:08:57+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-1     !INFO | scylla[6833]:  [shard  0] storage_service - decommission[d631036f-3ded-4cf7-ba66-cd4492dc89b6]: became a group 0 non-voter
< t:2023-09-03 14:20:13,541 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T14:20:12+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-7     !INFO | scylla[7918]:  [shard  0] raft_group0 - finish_setup_after_join: became a group 0 voter.
< t:2023-09-03 14:29:20,811 f:nemesis.py      l:3698 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 14:41:22,384 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T14:41:22+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-8     !INFO | scylla[8014]:  [shard  0] raft_group0 - finish_setup_after_join: became a group 0 voter.
< t:2023-09-03 14:49:46,734 f:nemesis.py      l:3698 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 15:02:03,500 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T15:02:02+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-9     !INFO | scylla[7633]:  [shard  0] raft_group0 - finish_setup_after_join: became a group 0 voter.
< t:2023-09-03 15:11:05,323 f:nemesis.py      l:3698 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'became a group 0 non-voter'
< t:2023-09-03 15:13:06,299 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T15:12:13+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-6     !INFO | scylla[7589]:  [shard  0] raft_group0 - became a non-voter.
< t:2023-09-03 15:13:06,299 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T15:12:13+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-6     !INFO | scylla[7589]:  [shard  0] storage_service - decommission[4b14b4d6-f84f-4799-9b27-5cde52a15a43]: became a group 0 non-voter
< t:2023-09-03 15:22:50,107 f:db_log_reader.py l:114  c:sdcm.db_log_reader   p:DEBUG > 2023-09-03T15:22:49+00:00 decommission-timeout-200k-pks-4d-lo-db-node-2058cc5b-0-10     !INFO | scylla[7625]:  [shard  0] raft_group0 - finish_setup_after_join: became a group 0 voter.

yarongilor commented 1 year ago

Addressed in https://github.com/scylladb/scylla-cluster-tests/issues/6522

roydahan commented 1 year ago

Not clear.

On Tue, Sep 12, 2023 at 13:25 Yaron Gilor @.***> wrote:

Addressed in #6522 https://github.com/scylladb/scylla-cluster-tests/issues/6522

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-cluster-tests/issues/6572#issuecomment-1715457084, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE75CYENY6AGUM5GMNH6MGTX2A2CNANCNFSM6AAAAAA4KFZZ6I . You are receiving this because you were mentioned.Message ID: @.***>

scylladb / scylla-cluster-tests