`decommission_streaming_err` nemesis times out too early when some `END-around` log message is awaited before rebooting target DB node

vponomaryov commented 2 months ago

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Running the disrupt_decommission_streaming_err nemesis SCT picks up one of the DB log messages to be awaited before rebooting the target node.

And when some END-around command gets picked up like the following:

2024-07-24 19:52:55,353 f:nemesis.py      l:3873 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.DecommissionStreamingErrMonkey: Reboot node after log message: 'Finished token ring movement'

Then it gets assigned to short time out (80m) and fails because of it.

The problem is that the decommission was going on ok, just required more time to reach required step. The target node is yellow on the screenshot below:

Screenshot from 2024-07-24 17-36-38

Steps to Reproduce

Run longevity-multi-keyspaces-with-tablets Ci job with the disrupt_decommission_streaming_err nemesis
See error

Expected behavior: timeout value must be more closer to the real life.

Actual behavior: timeout is too small

Impact

How frequently does it reproduce?

100%

Installation details

SCT Version: master Scylla version (or git commit hash): master

Logs

test_id: 7d0040da-769a-4e8c-bc93-5566e70b51e3
job log: scylla-staging/valerii/vp-longevity-multi-keyspaces-with-tablets#4 , CI

soyacz commented 2 months ago

Is this test passing without tablets? For me it looks like a very slow decommission. And the question, is Finished token ring movement log message valid for tablets?

vponomaryov commented 2 months ago

Is this test passing without tablets?

I didn't run it without tablets

For me it looks like a very slow decommission.

Cluster state directly influences the speed of decommission. Data size, CPU load, disk load and so on...

And the question, is Finished token ring movement log message valid for tablets?

This message is raft-specific and it is enabled in this test.

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?

soyacz commented 2 months ago

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?

I just wonder how it passed in the past for so long - it's not that new nemesis.

Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low. Switched to MAX_TIME_WAIT_FOR_DECOMMISSION which is set to 6h.

fruch commented 1 month ago

@aleksbykov

please look into this one

fruch commented 1 month ago

IMHO, 1 hour for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?

I just wonder how it passed in the past for so long - it's not that new nemesis.

this test case valeri is running does not used that often, and the refactoring @aleksbykov done to this neemeis wasn't done so long ago. i.e. this nemesis is relatively new, not surprise 1h isn't enough to all cases.

Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low. Switched to MAX_TIME_WAIT_FOR_DECOMMISSION which is set to 6h.

scylladb / scylla-cluster-tests