scylladb / gemini

Test data integrity by comparing against an Oracle running in parallel
Apache License 2.0
27 stars 16 forks source link

Gemini exits following error messages in log #377

Closed fgelcer closed 12 months ago

fgelcer commented 1 year ago

Issue description

Looking at the Argus run, we can see that Gemini has exited before the end of the test:

2023-07-02 04:28:07.442: (GeminiStressEvent Severity.ERROR) period_type=end event_id=d1f2ede3-2301-4627-9342-109a98cadbd0 duration=47m50s: node=Node gemini-with-nemesis-3h-normal-maste-loader-node-3211fb00-1 [52.16.152.190 | 10.4.1.135] (seed: False) gemini_cmd=./gemini -d --duration 3h --warmup 30m -c 50 -m mixed -f --non-interactive --cql-features normal --max-mutation-retries 5 --max-mutation-retries-backoff 500ms --async-objects-stabilization-attempts 5 --async-objects-stabilization-backoff 500ms --replication-strategy "{'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}" --oracle-replication-strategy "{'class': 'NetworkTopologyStrategy', 'replication_factor': '1'}" --test-cluster=10.4.3.228,10.4.1.25,10.4.1.134 --outfile /gemini_result_edf4a8cd-ff67-4b3e-a2c2-635bc0f8b2cc.log --seed 80 --request-timeout 180s --connect-timeout 120s --oracle-cluster=10.4.1.231
result=Exit code: 1
Command output: ['{"L":"INFO","T":"2023-07-02T04:28:06.341Z","M":"test finished"}', 'gemini encountered errors, exiting with non zero status']
errors=['Command error: Error: gemini encountered errors, exiting with non zero status\n\n']

in this time frame, i can see in the logs 2 error messages, that are probably the reason for the Gemini to exit, although the logs of Scylla continued:

Jul 02 04:28:00 gemini-with-nemesis-3h-normal-maste-oracle-db-node-3211fb00-1 scylla[10299]:  [shard 1] compaction - [Compact ks1.table1_mv_0 d4371140-1890-11ee-8c9d-000000000001] Compacting [/var/lib/scylla/data/ks1/table1_mv_0-2f326790188a11ee8c9d000000000001/md-471-big-Data.db:level=0,/var/lib/scylla/data/ks1/table1_mv_0-2f326790188a11ee8c9d000000000001/md-473-big-Data.db:level=0]
Jul 02 04:28:06 gemini-with-nemesis-3h-normal-maste-oracle-db-node-3211fb00-1 scylla[10299]:  [shard 0] cql_server - exception while processing connection: seastar::nested_exception: std::system_error (error system:32, sendmsg: Broken pipe) (while cleaning up after std::system_error (error system:32, sendmsg: Broken pipe))
Jul 02 04:28:06 gemini-with-nemesis-3h-normal-maste-oracle-db-node-3211fb00-1 scylla[10299]:  [shard 1] cql_server - exception while processing connection: seastar::nested_exception: std::system_error (error system:32, sendmsg: Broken pipe) (while cleaning up after std::system_error (error system:32, sendmsg: Broken pipe))
Jul 02 04:28:06 gemini-with-nemesis-3h-normal-maste-oracle-db-node-3211fb00-1 scylla[10299]:  [shard 0] compaction - [Compact ks1.table1_col6_idx_index d34fe590-1890-11ee-bad0-000000000000] Compacted 2 sstables to [/var/lib/scylla/data/ks1/table1_col6_idx_index-2c31f8d0188a11eebad0000000000000/md-460-big-Data.db:level=0]. 71MB to 69MB (~98% of original) in 7104ms = 9MB/s. ~256 total partitions merged to 118.

Impact

Gemini exited, and later, failed to verify_results(), and test ended with error

How frequently does it reproduce?

i'm re-running the same job, using the same Gemini seed to confirm what happens, and will update

Installation details

Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash): 5.4.0~dev-20230629.f6f974cdeb11 with build-id 7afc85749bdc68e7ee32eead35d51badd480c79f

Cluster size: 3 nodes (i3.large)

Scylla Nodes used in this run:

OS / Image: `` (aws: undefined_region)

Test: gemini-3h-with-nemesis-test Test id: 3211fb00-4a70-4d4f-9232-1df186f67bf2 Test name: scylla-master/gemini-/gemini-3h-with-nemesis-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 3211fb00-4a70-4d4f-9232-1df186f67bf2` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=3211fb00-4a70-4d4f-9232-1df186f67bf2) - Show all stored logs command: `$ hydra investigate show-logs 3211fb00-4a70-4d4f-9232-1df186f67bf2` ## Logs: - **db-cluster-3211fb00.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/db-cluster-3211fb00.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/db-cluster-3211fb00.tar.gz) - **sct-runner-events-3211fb00.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/sct-runner-events-3211fb00.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/sct-runner-events-3211fb00.tar.gz) - **sct-3211fb00.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/sct-3211fb00.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/sct-3211fb00.log.tar.gz) - **monitor-set-3211fb00.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/monitor-set-3211fb00.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/monitor-set-3211fb00.tar.gz) - **loader-set-3211fb00.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/loader-set-3211fb00.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/loader-set-3211fb00.tar.gz) - **parallel-timelines-report-3211fb00.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/parallel-timelines-report-3211fb00.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3211fb00-4a70-4d4f-9232-1df186f67bf2/20230702_045901/parallel-timelines-report-3211fb00.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/gemini-/job/gemini-3h-with-nemesis-test/385/) [Argus](https://argus.scylladb.com/test/c05bf635-e29a-490c-a72f-9d4e42e6c7e7/runs?additionalRuns[]=3211fb00-4a70-4d4f-9232-1df186f67bf2)
fgelcer commented 1 year ago

reported https://github.com/scylladb/scylladb/issues/14474 to try to understand why we encountered this error in the oracle's logs

dkropachev commented 12 months ago

It failed due to the following reason:

{"L":"INFO","T":"2023-07-02T04:28:06.218Z","N":"work cycle.validation_job","M":"Retring failed validation stoped by reach of max attempts 5. Error: unable to load check data from the oracle store: system failed: gocql: no response received from cassandra within timeout period"}
Error detected: &joberror.JobError{Timestamp:time.Date(2023, time.July, 2, 4, 28, 6, 218489658, time.Local), Message:"Validation failed: unable to load check data from the oracle store: system failed: gocql: no response received from cassandra within timeout period", Query:"SELECT * FROM ks1.table1 WHERE col4=893580704621996895 AND col6=36 ALLOW FILTERING "}{"L":"INFO","T":"2023-07-02T04:28:06.218Z","N":"work cycle.validation_job","M":"ending validation loop"}

So basically request failed 5 times and gemini failed, which is expected behavior

fgelcer commented 12 months ago

@dkropachev , could you improve the log message in this case, by:

  1. printing the max retry reach in a separated message
  2. raise the log level of this message to ERROR?