Open vponomaryov opened 2 years ago
The problem with hanging
is that wait_for_init
method waits, by default, 6 hours for new node init.
So, the solution here is to introduce additional check for catching problems on the go...
according to logs I see that there is the error at node start:
ERR | [shard 0] init - Startup failed: exceptions::mutation_write_timeout_exception (Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 - received only 5 responses from 6 CL=ALL.)
but this error converted to a WARNING by logic from https://github.com/scylladb/scylla-cluster-tests/pull/3521/files
06:38:38 < t:2022-02-20 05:38:38,562 f:file_logger.py l:89 c:sdcm.sct_events.file_logger p:INFO > 2022-02-20 05:38:38.558 <2022-02-20 05:38:38.000>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=3b16a216-68f6-4470-8cc2-0addbd2ccb35: type=SYSTEM_PAXOS_TIMEOUT regex=(mutation_write_|Operation timed out for system.paxos|Operation failed for system.paxos) line_number=14762 node=longevity-100gb-4h-master-db-node-490f8141-7
after this error test just hang for 8h (MAX_TIME_WAIT_FOR_NEW_NODE_UP: int = HOUR_IN_SEC * 8)
@juliayakovlev, Could you please remind me why this change was introduced? what problem did it solve? Can you suggest a solution for the current Issue?
according to logs I see that there is the error at node start:
ERR | [shard 0] init - Startup failed: exceptions::mutation_write_timeout_exception (Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 - received only 5 responses from 6 CL=ALL.)
but this error converted to a WARNING by logic from https://github.com/scylladb/scylla-cluster-tests/pull/3521/files
after this error test just hang for 8h (MAX_TIME_WAIT_FOR_NEW_NODE_UP: int = HOUR_IN_SEC * 8)
even it if this was at error level, we wouldn't stop the test cause of it, and it would wait for the whole 8h
@juliayakovlev, Could you please remind me why this change was introduced? what problem did it solve? Can you suggest a solution for the current Issue?
it was downgrade to warning cause of lot of LWT related code paths that ended up with timeouts in system.paxos related writes, and it was agree to ignore them, .*mutation_write_*
is a bit wide, and I think should be reverted.
Prerequisites
Versions
Logs
Description
If we run
disrupt_nodetool_decommission
nemesis and it fails to add a node then this nemesis never ends. It leads to the absence of the nemesis after it. So, we do not get any errors here and CI job may finish successfully. Which is false positive.Running
longevity-100gb-4h-ebs-gp3-test
job and starting mentioned nemesis we see following picture:node-2
successfully gets decommissioned in 15 minutes:Then we start provisioning new node:
Logs on the new node:
Last kind of the message appears for bunch of minutes and after it node doesn't do anything as well as nemesis.
Steps to Reproduce
disrupt_nodetool_decommission
nemesis as part of thelongevity-100gb-4h-ebs-gp3-test
job.Expected behavior: nemesis should not hang, it must finish either successfully or not, but finish.
Actual behavior: nemesis hangs till the end of the test. Only load runs. Hung nemesis is absent in the email report. False positive is provided having issue happened.