scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
52 stars 85 forks source link

Oracle cluster has more data than the test one after the reshape of 63GB which took 502 seconds #4250

Closed roydahan closed 6 months ago

roydahan commented 2 years ago

@vponomaryov commented on Thu Dec 16 2021

Installation details Kernel version: 5.11.0-1022-aws Scylla version (or git commit hash): 4.7.dev-0.20211215.3ac622bdd with build-id c19c7740f10f82f554c1b67da69f82fc48f9386f Cluster size: 3 nodes (i3.large) OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-045cd00d68f3af143 (eu-north-1)

Scylla running with shards number (live nodes): gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-1 (13.48.5.235 | 10.0.3.26): 2 shards gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-2 (16.170.237.202 | 10.0.2.183): 2 shards gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 (13.51.156.207 | 10.0.2.151): 2 shards

Dead Scylla nodes: gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-3 (13.48.190.179 | 10.0.0.129): 2 shards, terminated at 2021-12-15 11:47:05.662660

Oracle AMI: ami-0118e6d3cacc225b4 Oracle instance type: i3.8xlarge Oracle nodes: gemini-with-nemesis-3h-normal-maste-oracle-db-node-a9ab70d5-1 (13.51.233.71 | 10.0.1.23)

Test: gemini-3h-with-nemesis-test Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

Issue description StopWaitStartScyllaServer nemesis stops scylla, sleeps for 5 minutes and then starts it. It failed with the following error:

2021-12-15 12:35:27.852: (DisruptionEvent Severity.ERROR) 
    period_type=end 
    event_id=4604f0d1-17c0-4561-b02b-99b0ce2cc112 
    duration=15m53s: nemesis_name=StopWaitStartScyllaServer 
    target_node=Node 
        gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 
        [13.51.156.207 | 10.0.2.151] 
        (seed: False) 
    errors=Command did not complete within 500 seconds!
Command: 'sudo systemctl start scylla-server.service'

If we look at the node 4, where nemesis took place, we see following:

2021-12-15T12:19:34+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !  NOTICE | scylla-jmx.service: Main process exited, code=exited, status=143/n/a
...
2021-12-15T12:19:34+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO | Stopping Scylla Server...
...
2021-12-15T12:25:40+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO | Starting Scylla Server...
...
2021-12-15T12:25:43+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO |  [shard 1] compaction - [Reshape ks1.table1 1f8e0910-5da2-11ec-8fc2-a85329e66d69] Reshaping ...
...
2021-12-15T12:34:05+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO |  [shard 0] compaction - [Reshape ks1.table1 1f8dbaf0-5da2-11ec-a8c6-a85229e66d69] Reshaped 32 sstables to [/var/lib/scylla/data/ks1/table1-6441e4805d8b11ec9a8c0e8adb8bd841/md-86-big-Data.db:level=0]. 29GB to 17GB (~57% of original) in 501964ms = 34MB/s. ~2311168 total partitions merged to 1337228.
2021-12-15T12:34:05+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO |  [shard 0] database - Reshaped 63GB in 501.99 seconds, 126MB/s
...
2021-12-15T12:34:05+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO | Started Scylla Server.

So, the reshape of 63Gb took more than the timeout - 502s > 500s. And then, as a result, we get discrepancy between test and oracle clusters:

2021-12-15 12:40:37.939: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=6c07a9fe-0013-4a78-86ed-94d296f538a7, source=GeminiTest.test_load_random_with_nemesis (gemini_test.GeminiTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 68, in test_load_random_with_nemesis
self.verify_results()
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 127, in verify_results
self.fail(self.gemini_results['results'])
AssertionError: [{
    'write_ops': 2900244, 
    'write_errors': 0, 
    'read_ops': 466093, 
    'read_errors': 1, 
    'errors': [{
        'timestamp': '2021-12-15T12:40:06.648224093Z', 
        'message': 'Validation failed: row count differ (
            test has 0 rows, 
            oracle has 1 rows, 
            test is missing rows: [pk0=-40, \tpk1=235422220, \tck0=166.98.24.83, \tck1=2493207025138893.410], 
            oracle is missing rows: [])', 
        'query': 'SELECT * FROM ks1.table1 WHERE pk0=-40 AND pk1=235422220 '
    }]
}]

Restore Monitor Stack command: $ hydra investigate show-monitor a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c

Test id: a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c/20211215_124608/db-cluster-a9ab70d5.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c/20211215_124608/loader-set-a9ab70d5.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c/20211215_124608/monitor-set-a9ab70d5.tar.gz sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c/20211215_124608/sct-runner-a9ab70d5.tar.gz

Jenkins job URL


@slivne commented on Mon Dec 20 2021

@roydahan not sure this is not a test issue

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 2 years with no activity. Remove stale label or comment or this will be closed in 2 days.

github-actions[bot] commented 6 months ago

This issue was closed because it has been stalled for 2 days with no activity.