scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
53 stars 92 forks source link

Critical CassandraStressEvent received after end of test #5558

Open juliayakovlev opened 1 year ago

juliayakovlev commented 1 year ago

Test reached its timeout and kill running cassandra-stress thread. In this case critical CassandraStressEvent is not expected. But we got it in the https://jenkins.scylladb.com/job/scylla-master/job/raft/job/longevity-lwt-500G-3d-test/3/

2022-12-11 21:32:30.068: (TestTimeoutEvent Severity.CRITICAL) period_type=not-set event_id=519f9d31-995f-4a67-a210-a6db0fe6cfc1, Test started at 2022-12-08 19:22:29, reached it's timeout (4450 minute)
2022-12-11 22:06:48.303: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=28722b40-8057-4818-8117-b8dfe414d92a during_nemesis=NodetoolCleanup duration=2d10h37m56s: node=Node longevity-lwt-500G-3d-master-loader-node-062617a6-1 [54.75.126.242 | 10.4.1.219] (seed: False)
stress_cmd=cassandra-stress user profile=/tmp/lwt_builtin_functions.yaml ops'(lwt_update_by_pk=1,lwt_update_by_ck=1)' cl=QUORUM duration=3600m -mode native cql3 -rate threads=10 -pop seq=33333334..66666666
errors:
Stress command execution failed with:

https://argus.scylladb.com/test/3a98b3f1-5534-4e95-a7a9-2f4104f4c851/runs?additionalRuns%5B%5D=062617a6-15f2-46cd-aaed-78b847098fac

AFAIR Dmitry fixed this problem. But now we have it again

fruch commented 1 year ago

@juliayakovlev what's the expectation regarding those critical events ? To be suppressed and not shown anywhere ?

The only fix I'm aware of is makeing sure the signal of stopping the test is being sent only once.

juliayakovlev commented 1 year ago

@juliayakovlev what's the expectation regarding those critical events ? To be suppressed and not shown anywhere ?

The only fix I'm aware of is makeing sure the signal of stopping the test is being sent only once.

Yes, we should not send critical c-s event. I remember Dmitiry fixed but did not remember how. Something connected to threads

juliayakovlev commented 1 year ago

Issue description

sdcm.tester.ClusterTester.stop_resources:

    def stop_resources(self):  # pylint: disable=no-self-use
        self.log.debug('Stopping all resources')
        with silence(parent=self, name="Kill Stress Threads"):
            self.kill_stress_thread()

        # Stopping nemesis, using timeout of 30 minutes, since replace/decommission node can take time
        if self.db_cluster:
            self.get_nemesis_report(self.db_cluster)
            self.stop_nemesis(self.db_cluster)
            self.stop_resources_stop_tasks_threads(self.db_cluster)

After TearDown starts, we kill stress threads and after that we stop the nemesis. It this test the stress thread is running from the nemesis. When we kill the stress thread, we get CassandraStressEvent Severity.CRITICAL:

< t:2023-03-30 12:18:38,010 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-03-30 12:18:38.008: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=fd149787-2ddb-450e-9702-7bb5c275c5ae during_nemesis=ReplaceServiceLevelUsingDropDuringLoad duration=24m44s: node=Node longevity-5gb-1h-slareplaceusingdro-loader-node-3a57a28f-eastus-1 [172.174.236.76 | 10.0.0.8] (seed: False)
< t:2023-03-30 12:18:38,010 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress read cl=QUORUM duration=45m -mode cql3 native user=role500_6f8f007e password=rolep500 -rate threads=200 -pop seq=1..1514571 -errors retries=50 -col 'n=FIXED(8) size=FIXED(128)'

Maybe we need to change the order: stop nemesis and then stop the stress threads, that still run?

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1034-azure Scylla version (or git commit hash): 2023.1.0~rc3-20230321.80de75947b7a with build-id 6e1d6cb6cac9242e7ed7bfd8b07c1fc5998281dc

Cluster size: 3 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-enterprise-2023.1.0-rc3-x86_64-2023-03-23T13-10-32 (azure: eastus)

Test: longevity-5gb-1h-SlaReplaceUsingDropDuringLoad-azure-test Test id: 3a57a28f-a370-4cd6-a913-b2724b2b3890 Test name: enterprise-2023.1/nemesis/longevity-5gb-1h-SlaReplaceUsingDropDuringLoad-azure-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 3a57a28f-a370-4cd6-a913-b2724b2b3890` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=3a57a28f-a370-4cd6-a913-b2724b2b3890) - Show all stored logs command: `$ hydra investigate show-logs 3a57a28f-a370-4cd6-a913-b2724b2b3890` ## Logs: - **db-cluster-3a57a28f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/db-cluster-3a57a28f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/db-cluster-3a57a28f.tar.gz) - **sct-runner-3a57a28f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/sct-runner-3a57a28f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/sct-runner-3a57a28f.tar.gz) - **monitor-set-3a57a28f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/monitor-set-3a57a28f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/monitor-set-3a57a28f.tar.gz) - **loader-set-3a57a28f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/loader-set-3a57a28f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/loader-set-3a57a28f.tar.gz) - **parallel-timelines-report-3a57a28f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/parallel-timelines-report-3a57a28f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3a57a28f-a370-4cd6-a913-b2724b2b3890/20230330_123522/parallel-timelines-report-3a57a28f.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/nemesis/job/longevity-5gb-1h-SlaReplaceUsingDropDuringLoad-azure-test/3/)