scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 95 forks source link

stopping nemesis while NEMESIS_LOCK is locked, raise a failure #6490

Closed fruch closed 1 year ago

fruch commented 1 year ago

Issue description

Looks like recent change #6442 is causing the following failure (since it's stopping nemesis thread while waiting for a NEMESIS_LOCK):

2023-08-13 02:47:48.904: (ThreadFailedEvent Severity.ERROR) period_type=one-time event_id=0348cf22-7a75-4268-a9ac-d9328194d214: message='140297676756048--disrupt_hard_reboot_node'
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4868, in wrapper
NEMESIS_LOCK.acquire()  # pylint: disable=consider-using-with
sdcm.exceptions.KillNemesis
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 362, in run
self.disrupt()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 6013, in disrupt
self.call_next_nemesis()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1792, in call_next_nemesis
self.execute_disrupt_method(disrupt_method=self.disruptions_list.pop())
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1720, in execute_disrupt_method
disrupt_method()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4970, in wrapper
NEMESIS_RUN_INFO.pop(nemesis_run_info_key)
KeyError: '140297676756048--disrupt_hard_reboot_node'

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.10.184-175.749.amzn2.x86_64 Scylla version (or git commit hash): 5.4.0~dev-20230812.d1d1b6cf6e01 with build-id 6c4f55c26164d6fe2cd25d38f5022795ce696d9c

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.10.0-alpha.0-28-gd131f8e Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (k8s-eks: undefined_region)

Test: longevity-scylla-operator-3h-multitenant-eks Test id: 95e46710-c84c-48a2-9ef9-6366e2a664cf Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-3h-multitenant-eks Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 95e46710-c84c-48a2-9ef9-6366e2a664cf` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=95e46710-c84c-48a2-9ef9-6366e2a664cf) - Show all stored logs command: `$ hydra investigate show-logs 95e46710-c84c-48a2-9ef9-6366e2a664cf` ## Logs: - **kubernetes-95e46710.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/kubernetes-95e46710.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/kubernetes-95e46710.tar.gz) - **db-cluster-95e46710.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/db-cluster-95e46710.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/db-cluster-95e46710.tar.gz) - **sct-runner-events-95e46710.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/sct-runner-events-95e46710.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/sct-runner-events-95e46710.tar.gz) - **sct-95e46710.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/sct-95e46710.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/sct-95e46710.log.tar.gz) - **loader-set-95e46710.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/loader-set-95e46710.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/loader-set-95e46710.tar.gz) - **monitor-set-95e46710.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/monitor-set-95e46710.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/monitor-set-95e46710.tar.gz) - **parallel-timelines-report-95e46710.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/parallel-timelines-report-95e46710.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/95e46710-c84c-48a2-9ef9-6366e2a664cf/20230813_073220/parallel-timelines-report-95e46710.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/eks/job/longevity-scylla-operator-3h-multitenant-eks/72/) [Argus](https://argus.scylladb.com/test/7774c24a-b749-4528-97a4-22785e7e5b6f/runs?additionalRuns[]=95e46710-c84c-48a2-9ef9-6366e2a664cf)
vponomaryov commented 1 year ago

PR: https://github.com/scylladb/scylla-cluster-tests/pull/6502