CreateIndex nemesis was started (and failed) on a node that was previously terminated by NodeTerminateAndReplace parallel nemesis

dimakr commented 1 month ago

Packages

Scylla version: 2024.1.8-20240724.fc3e399a25f3 with build-id 646cf933d8926947ade5b2a7cbc5bacb145df4fb Kernel Version: 5.15.0-1066-aws

Issue description

During enterprise-2024.1/longevity/longevity-multidc-schema-topology-changes-12h-test#26 test disrupt_terminate_and_replace_node and disrupt_create_index nemeses were started in parallel and were targeted onto the same node-7. NodeTerminateAndReplace nemesis started node termination at 02:43:52 and finished at 02:49:38:

2024-07-25 02:43:52,385 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 02:43:52.383: (InfoEvent Severity.NORMAL) period_type=not-set event_id=b225fcb9-0252-4021-b079-81bdd9c5508c: message=StartEvent - Terminate node and wait 5 minutes
...
2024-07-25 02:49:38,553 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 02:49:38.552: (InfoEvent Severity.NORMAL) period_type=not-set event_id=7f4d8643-394f-4b82-b161-8b1661ea938f: message=FinishEvent - target_node was terminated

CreateIndex tried to start index creation on node-7 at 02:54:47:

2024-07-25 02:54:47,644 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 02:54:47.641: (InfoEvent Severity.NORMAL) period_type=not-set event_id=3f17c704-6a15-44d2-8442-7f30371e414f: message=Starting creating index: keyspace1.standard2(c1)

and eventually failed as the node was no longer available, with the error:

Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5094, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4758, in disrupt_create_index
wait_for_index_to_be_built(self.target_node, ks, index_name, timeout=timeout * 2)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/nemesis_utils/indexes.py", line 73, in wait_for_index_to_be_built
wait_for_view_to_be_built(node=node, ks=ks, view_name=f'{index_name}_index', timeout=timeout)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/nemesis_utils/indexes.py", line 80, in wait_for_view_to_be_built
result = node.run_nodetool(f"viewbuildstatus {ks}.{view_name}",
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2605, in run_nodetool
runner(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 607, in _run
if self._run_on_retryable_exception(exc, new_session):
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_libssh_cmd_runner.py", line 78, in _run_on_retryable_exception
raise RetryableNetworkException(str(exc), original=exc)
sdcm.remote.base.RetryableNetworkException: Failed to run a command due to exception!
Command: '/usr/bin/nodetool  viewbuildstatus keyspace1.standard2_c1_nemesis_index '
Stdout:
Stderr:
Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 588, in run
self.connect()
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 524, in connect
raise ConnectTimeout(ex_msg) from exc
Failed to connect in 60 seconds, last error: (ConnectError)Error connecting to host '10.3.1.199:22' - timed out

Impact

Parallel nemeses affected one another in a disruptive manner.

How frequently does it reproduce?

No other occurrences of the issue were noticed.

Installation details

Cluster size: 12 nodes (i3en.2xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: ami-072fc07743bf86cd3 ami-09a43832bb62c9b19 (aws: undefined_region)

Test: longevity-multidc-schema-topology-changes-12h-test Test id: 97c11d18-65ec-4dfa-9b9d-70ba669c3f11 Test name: enterprise-2024.1/longevity/longevity-multidc-schema-topology-changes-12h-test Test config file(s):

longevity-multidc-parallel-topology-schema-changes-12h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 97c11d18-65ec-4dfa-9b9d-70ba669c3f11` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=97c11d18-65ec-4dfa-9b9d-70ba669c3f11) - Show all stored logs command: `$ hydra investigate show-logs 97c11d18-65ec-4dfa-9b9d-70ba669c3f11` ## Logs: - **db-cluster-97c11d18.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/db-cluster-97c11d18.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/db-cluster-97c11d18.tar.gz) - **sct-runner-events-97c11d18.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/sct-runner-events-97c11d18.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/sct-runner-events-97c11d18.tar.gz) - **sct-97c11d18.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/sct-97c11d18.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/sct-97c11d18.log.tar.gz) - **loader-set-97c11d18.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/loader-set-97c11d18.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/loader-set-97c11d18.tar.gz) - **monitor-set-97c11d18.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/monitor-set-97c11d18.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/97c11d18-65ec-4dfa-9b9d-70ba669c3f11/20240725_151613/monitor-set-97c11d18.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2024.1/job/longevity/job/longevity-multidc-schema-topology-changes-12h-test/26/) [Argus](https://argus.scylladb.com/test/8764d79e-d785-4b99-9c70-99e2f24d0a18/runs?additionalRuns[]=97c11d18-65ec-4dfa-9b9d-70ba669c3f11)

soyacz commented 1 month ago

hmm, shouldn't nemesis select a node that is not the target_node in parallel nemesis?

fruch commented 4 weeks ago

the problem is that they both are the first

sdcm.nemesis.SisyphusMonkey: Current Target: Node parallel-topology-schema-changes-mu-db-node-97c11d18-7 [13.40.68.247 | 10.3.1.199] (dc name: eu-west-2scylla_node_west, rack: 2a) with running nemesis: None

running nemesis: None is the problem, it means both select a node, while that node isn't decorated with a running nemesis yet

scylladb / scylla-cluster-tests