Open vponomaryov opened 2 months ago
Is this test passing without tablets?
For me it looks like a very slow decommission.
And the question, is Finished token ring movement
log message valid for tablets?
Is this test passing without tablets?
I didn't run it without tablets
For me it looks like a very slow decommission.
Cluster state directly influences the speed of decommission. Data size, CPU load, disk load and so on...
And the question, is
Finished token ring movement
log message valid for tablets?
This message is raft-specific and it is enabled in this test.
IMHO, 1 hour
for decommission doesn't sound as something "too much" for a healthy busy DB cluster.
Do we have somewhere the limits for the decommission operation?
IMHO,
1 hour
for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?
I just wonder how it passed in the past for so long - it's not that new nemesis.
Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low.
Switched to MAX_TIME_WAIT_FOR_DECOMMISSION
which is set to 6h.
@aleksbykov
please look into this one
IMHO,
1 hour
for decommission doesn't sound as something "too much" for a healthy busy DB cluster. Do we have somewhere the limits for the decommission operation?I just wonder how it passed in the past for so long - it's not that new nemesis.
this test case valeri is running does not used that often, and the refactoring @aleksbykov done to this neemeis wasn't done so long ago. i.e. this nemesis is relatively new, not surprise 1h isn't enough to all cases.
Recently by mistake I did 2h timeout when introducing parallel nodes operations and indeed it was too low. Switched to
MAX_TIME_WAIT_FOR_DECOMMISSION
which is set to 6h.
Issue description
Running the
disrupt_decommission_streaming_err
nemesis SCT picks up one of the DB log messages to be awaited before rebooting the target node.And when some
END
-around command gets picked up like the following:Then it gets assigned to short time out (
80m
) and fails because of it.The problem is that the decommission was going on ok, just required more time to reach required step. The target node is yellow on the screenshot below:
Steps to Reproduce
longevity-multi-keyspaces-with-tablets
Ci job with thedisrupt_decommission_streaming_err
nemesisExpected behavior: timeout value must be more closer to the real life.
Actual behavior: timeout is too small
Impact
How frequently does it reproduce?
100%
Installation details
SCT Version: master Scylla version (or git commit hash): master
Logs