scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 95 forks source link

Gemini verification started during Tear-down and failed for quorum unavailable nodes #8379

Open yarongilor opened 2 months ago

yarongilor commented 2 months ago

Packages

Scylla version: 6.0.3-20240808.a56f7ce21ad4 with build-id 00ad3169bb53c452cf2ab93d97785dc56117ac3e

Kernel Version: 5.15.0-1067-aws

Issue description

Describe your issue in detail and steps it took to produce it.

  1. health check failed (due to an issue with remove-node nemesis)
  2. Tear down started.
  3. Gemini verification started and failed getting a quorum
    < t:2024-08-11 15:47:02,573 f:tester.py       l:2887 c:GeminiTest           p:INFO  > TearDown is starting...
    < t:2024-08-11 15:47:35,018 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     self.verify_results()
    2024-08-11 15:47:35.015: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=b4a77eac-1b45-4690-9c0a-7e80372055af, source=GeminiTest.test_load_random_with_nemesis (gemini_test.GeminiTest)() message=Traceback (most recent call last):
    File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 68, in test_load_random_with_nemesis
    self.verify_results()
    File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 127, in verify_results
    self.fail(self.gemini_results['results'])
    AssertionError: [{'errors': [{'timestamp': '2024-08-11T15:46:28.472514048Z', 'message': 'Validation failed: unable to load check data from the test store: Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1', 'query': 'SELECT * FROM ks1.table1_mv_0 WHERE col8= AND pk0=674687930108493689 AND pk1=8424091603174626176 ', 'stmt-type': 'SelectStatement'}

    Perhaps It would be best if Teardown first update other thread or stop it, in order to avoid such collisions.

    Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0c6a6957b89f8504f (aws: undefined_region)

Test: gemini-3h-with-nemesis-test Test id: 5d11f833-59fd-4573-ba63-afec8d1b175b Test name: scylla-6.0/gemini/gemini-3h-with-nemesis-test Test method: gemini_test.GeminiTest.test_load_random_with_nemesis Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 5d11f833-59fd-4573-ba63-afec8d1b175b` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=5d11f833-59fd-4573-ba63-afec8d1b175b) - Show all stored logs command: `$ hydra investigate show-logs 5d11f833-59fd-4573-ba63-afec8d1b175b` ## Logs: - **db-cluster-5d11f833.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/db-cluster-5d11f833.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/db-cluster-5d11f833.tar.gz) - **sct-runner-events-5d11f833.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/sct-runner-events-5d11f833.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/sct-runner-events-5d11f833.tar.gz) - **sct-5d11f833.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/sct-5d11f833.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/sct-5d11f833.log.tar.gz) - **loader-set-5d11f833.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/loader-set-5d11f833.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/loader-set-5d11f833.tar.gz) - **monitor-set-5d11f833.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/monitor-set-5d11f833.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/monitor-set-5d11f833.tar.gz) - **parallel-timelines-report-5d11f833.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/parallel-timelines-report-5d11f833.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5d11f833-59fd-4573-ba63-afec8d1b175b/20240811_154920/parallel-timelines-report-5d11f833.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-6.0/job/gemini/job/gemini-3h-with-nemesis-test/16/) [Argus](https://argus.scylladb.com/test/2873b203-404e-492c-957e-8cf49830f0f5/runs?additionalRuns[]=5d11f833-59fd-4573-ba63-afec8d1b175b)
yarongilor commented 2 months ago

@fruch do you have any idea if SCT already bumped similar issues? or is there already any suggested improvement?

fruch commented 2 months ago

@fruch do you have any idea if SCT already bumped similar issues? or is there already any suggested improvement?

Are you sure of the order of things ?

Test isn't supposed to end before stress commands are finished.

If it stopped with timeout of the test, something isn't working as expected, or stress took longer then it was asked to run, or test timeout is too small.

If stress is running during teardown, it's also not a reason for nodes to be gone

fruch commented 2 months ago

You are completely barking at the wrong tree, that an abort during the test during a nemesis that changes topology.

You clearly lost quorum, and SCT has nothing to do about it.

fruch commented 2 months ago

This is not an issue with SCT,

Gemini shows its failure once it finishes, it has nothing to do if it's during teardown or not.

DB nodes are not stopped on teardown

fruch commented 2 months ago

looking at it again, one node is lost in disrupt_remove_node_then_add_node, and wasn't replaced, cause of failure in removenode

and then one more node stopped during enospc nemesis

this case has only 3 nodes, and two are gone, guess what gemini would fail....