Oracle cluster has more data than the test one after the reshape of 63GB which took 502 seconds

roydahan commented 2 years ago

@vponomaryov commented on Thu Dec 16 2021

Installation details Kernel version: 5.11.0-1022-aws Scylla version (or git commit hash): 4.7.dev-0.20211215.3ac622bdd with build-id c19c7740f10f82f554c1b67da69f82fc48f9386f Cluster size: 3 nodes (i3.large) OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-045cd00d68f3af143 (eu-north-1)

Scylla running with shards number (live nodes): gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-1 (13.48.5.235 | 10.0.3.26): 2 shards gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-2 (16.170.237.202 | 10.0.2.183): 2 shards gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 (13.51.156.207 | 10.0.2.151): 2 shards

Dead Scylla nodes: gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-3 (13.48.190.179 | 10.0.0.129): 2 shards, terminated at 2021-12-15 11:47:05.662660

Oracle AMI: ami-0118e6d3cacc225b4 Oracle instance type: i3.8xlarge Oracle nodes: gemini-with-nemesis-3h-normal-maste-oracle-db-node-a9ab70d5-1 (13.51.233.71 | 10.0.1.23)

Test: gemini-3h-with-nemesis-test Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

gemini-3h-with-nemesis.yaml

Issue description StopWaitStartScyllaServer nemesis stops scylla, sleeps for 5 minutes and then starts it. It failed with the following error:

2021-12-15 12:35:27.852: (DisruptionEvent Severity.ERROR) 
    period_type=end 
    event_id=4604f0d1-17c0-4561-b02b-99b0ce2cc112 
    duration=15m53s: nemesis_name=StopWaitStartScyllaServer 
    target_node=Node 
        gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 
        [13.51.156.207 | 10.0.2.151] 
        (seed: False) 
    errors=Command did not complete within 500 seconds!
Command: 'sudo systemctl start scylla-server.service'

If we look at the node 4, where nemesis took place, we see following:

2021-12-15T12:19:34+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !  NOTICE | scylla-jmx.service: Main process exited, code=exited, status=143/n/a
...
2021-12-15T12:19:34+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO | Stopping Scylla Server...
...
2021-12-15T12:25:40+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO | Starting Scylla Server...
...
2021-12-15T12:25:43+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO |  [shard 1] compaction - [Reshape ks1.table1 1f8e0910-5da2-11ec-8fc2-a85329e66d69] Reshaping ...
...
2021-12-15T12:34:05+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO |  [shard 0] compaction - [Reshape ks1.table1 1f8dbaf0-5da2-11ec-a8c6-a85229e66d69] Reshaped 32 sstables to [/var/lib/scylla/data/ks1/table1-6441e4805d8b11ec9a8c0e8adb8bd841/md-86-big-Data.db:level=0]. 29GB to 17GB (~57% of original) in 501964ms = 34MB/s. ~2311168 total partitions merged to 1337228.
2021-12-15T12:34:05+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO |  [shard 0] database - Reshaped 63GB in 501.99 seconds, 126MB/s
...
2021-12-15T12:34:05+00:00 gemini-with-nemesis-3h-normal-maste-db-node-a9ab70d5-4 !    INFO | Started Scylla Server.

So, the reshape of 63Gb took more than the timeout - 502s > 500s. And then, as a result, we get discrepancy between test and oracle clusters:

2021-12-15 12:40:37.939: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=6c07a9fe-0013-4a78-86ed-94d296f538a7, source=GeminiTest.test_load_random_with_nemesis (gemini_test.GeminiTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 68, in test_load_random_with_nemesis
self.verify_results()
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 127, in verify_results
self.fail(self.gemini_results['results'])
AssertionError: [{
    'write_ops': 2900244, 
    'write_errors': 0, 
    'read_ops': 466093, 
    'read_errors': 1, 
    'errors': [{
        'timestamp': '2021-12-15T12:40:06.648224093Z', 
        'message': 'Validation failed: row count differ (
            test has 0 rows, 
            oracle has 1 rows, 
            test is missing rows: [pk0=-40, \tpk1=235422220, \tck0=166.98.24.83, \tck1=2493207025138893.410], 
            oracle is missing rows: [])', 
        'query': 'SELECT * FROM ks1.table1 WHERE pk0=-40 AND pk1=235422220 '
    }]
}]

Restore Monitor Stack command: $ hydra investigate show-monitor a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c

Test id: a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c/20211215_124608/db-cluster-a9ab70d5.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c/20211215_124608/loader-set-a9ab70d5.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c/20211215_124608/monitor-set-a9ab70d5.tar.gz sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/a9ab70d5-f7f1-4c1b-8f26-cd5700843d1c/20211215_124608/sct-runner-a9ab70d5.tar.gz

Jenkins job URL

@slivne commented on Mon Dec 20 2021

@roydahan not sure this is not a test issue

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 2 years with no activity. Remove stale label or comment or this will be closed in 2 days.

github-actions[bot] commented 6 months ago

This issue was closed because it has been stalled for 2 days with no activity.

scylladb / scylla-cluster-tests

Oracle cluster has more data than the test one after the reshape of 63GB which took 502 seconds #4250