scylladb / gemini

Test data integrity by comparing against an Oracle running in parallel
Apache License 2.0
31 stars 18 forks source link

[Bug] Validation Error in GeminiTest.test_load_random_with_nemesis: Rows Differ Due to Repeated Entries in Column 'col5' #436

Open timtimb0t opened 1 month ago

timtimb0t commented 1 month ago

Packages

Scylla version: 6.2.0~rc1-20240919.a71d4bc49cc8 with build-id 2a79c005ca22208ec14a7708a4f423e96b5d861f

Kernel Version: 6.8.0-1016-aws

Issue description

Happens during this job: https://jenkins.scylladb.com/job/scylla-6.2/job/gemini/job/gemini-3h-with-nemesis-test/4/

2024-09-22 15:35:20.939: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=f294216e-d5eb-4a5d-b9d0-947174db18ba, source=GeminiTest.test_load_random_with_nemesis (gemini_test.GeminiTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 68, in test_load_random_with_nemesis
self.verify_results()
File "/home/ubuntu/scylla-cluster-tests/gemini_test.py", line 127, in verify_results
self.fail(self.gemini_results['results'])
AssertionError: [{'errors': [{'timestamp': '2024-09-22T15:34:41.040271558Z', 'message': 'Validation failed: rows differ (-map[ck0:0e5bd852e93365ceba7b983b1e7ae82f0e ck1:196.209.17.42 col0:1114235h14m33.527540656s col1:map[428598df-64b6-1ac8-898a-0269c58502ef:4428111412553544617 5c32fb93-feff-158e-898c-0269c58502ef:4779579801187610993 5f827495-0ce5-1082-898b-0269c58502ef:7356596969051568171] col10:d9e375ad-94d0-1fe4-898f-0269c58502ef col11:false col12:0.57534295 col2:8f1bbdd2-d836-1a8b-898d-0269c58502ef col3:1714736h23m51.156602912s col4:map[udt_1507393928_0:d668425f-84e6-1d9f-898e-0269c58502ef udt_1507393928_1:1767-09-21 01:16:18.871345152 +0000 UTC udt_1507393928_2:48254040-10-31 14:06:06.583 +0000 UTC udt_1507393928_3:131653931-08-13 22:35:40.318 +0000 UTC udt_1507393928_4:-6456 udt_1507393928_5:0.12871255 udt_1507393928_6:6936] col5:[7988919290375467742 4376061058815166327 379500719263269858 5009093180938296012 7174863172911282077 6522356728298859312 599653341271145931 3490621354264608579 7939102450213922708] col6:236.236.209.136 col7:3642 col8:0.8230271 col9:505143h58m44.369718516s pk0:8369222297109090103 pk1:5469874807351661720 pk2:-12846] 
+map[ck0:0e5bd852e93365ceba7b983b1e7ae82f0e ck1:196.209.17.42 col0:1114235h14m33.527540656s col1:map[428598df-64b6-1ac8-898a-0269c58502ef:4428111412553544617 5c32fb93-feff-158e-898c-0269c58502ef:4779579801187610993 5f827495-0ce5-1082-898b-0269c58502ef:7356596969051568171] col10:d9e375ad-94d0-1fe4-898f-0269c58502ef col11:false col12:0.57534295 col2:8f1bbdd2-d836-1a8b-898d-0269c58502ef col3:1714736h23m51.156602912s col4:map[udt_1507393928_0:d668425f-84e6-1d9f-898e-0269c58502ef udt_1507393928_1:1767-09-21 01:16:18.871345152 +0000 UTC udt_1507393928_2:48254040-10-31 14:06:06.583 +0000 UTC udt_1507393928_3:131653931-08-13 22:35:40.318 +0000 UTC udt_1507393928_4:-6456 udt_1507393928_5:0.12871255 udt_1507393928_6:6936] col5:[7988919290375467742 7988919290375467742 4376061058815166327 4376061058815166327 379500719263269858 379500719263269858 5009093180938296012 5009093180938296012 7174863172911282077 7174863172911282077 6522356728298859312 6522356728298859312 599653341271145931 599653341271145931 3490621354264608579 3490621354264608579 7939102450213922708 7939102450213922708] col6:236.236.209.136 col7:3642 col8:0.8230271 col9:505143h58m44.369718516s pk0:8369222297109090103 pk1:5469874807351661720 pk2:-12846]):   map[string]any{\n  \t... // 8 identical entries\n  \t"col3": s"1714736h23m51.156602912s",\n  \t"col4": map[string]any{"udt_1507393928_0": gocql.UUID{0xd6, 0x68, 0x42, 0x5f, ...}, "udt_1507393928_1": s"1767-09-21 01:16:18.871345152 +0000 UTC", "udt_1507393928_2": s"48254040-10-31 14:06:06.583 +0000 UTC", "udt_1507393928_3": s"131653931-08-13 22:35:40.318 +0000 UTC", ...},\n  \t"col5": []int64{\n  \t\t7988919290375467742,\n+ \t\t7988919290375467742,\n  \t\t4376061058815166327,\n+ \t\t4376061058815166327,\n  \t\t379500719263269858,\n+ \t\t379500719263269858,\n  \t\t5009093180938296012,\n+ \t\t5009093180938296012,\n  \t\t7174863172911282077,\n+ \t\t7174863172911282077,\n+ \t\t6522356728298859312,\n  \t\t6522356728298859312,\n+ \t\t599653341271145931,\n  \t\t599653341271145931,\n+ \t\t3490621354264608579,\n  \t\t3490621354264608579,\n+ \t\t7939102450213922708,\n  \t\t7939102450213922708,\n  \t},\n  \t"col6": string("236.236.209.136"),\n  \t"col7": int16(3642),\n  \t... // 5 identical entries\n  }\n', 'query': 'SELECT * FROM ks1.table1 WHERE pk0=8369222297109090103 AND pk1=5469874807351661720 AND pk2=-12846 ', 'stmt-type': 'SelectStatement'}], 'write_ops': 29336134, 'write_errors': 0, 'read_ops': 6545513, 'read_errors': 1}]

Seems like results are duplicated. Not sure but may be driver related isse

How frequently does it reproduce?

Just once

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-059b505168db98ed8 (aws: undefined_region)

Test: gemini-3h-with-nemesis-test Test id: 5dffaac7-04b4-473d-a574-c53d59bfd567 Test name: scylla-6.2/gemini/gemini-3h-with-nemesis-test Test method: gemini_test.GeminiTest.test_load_random_with_nemesis Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 5dffaac7-04b4-473d-a574-c53d59bfd567` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=5dffaac7-04b4-473d-a574-c53d59bfd567) - Show all stored logs command: `$ hydra investigate show-logs 5dffaac7-04b4-473d-a574-c53d59bfd567` ## Logs: - **db-cluster-5dffaac7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/db-cluster-5dffaac7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/db-cluster-5dffaac7.tar.gz) - **sct-runner-events-5dffaac7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/sct-runner-events-5dffaac7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/sct-runner-events-5dffaac7.tar.gz) - **sct-5dffaac7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/sct-5dffaac7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/sct-5dffaac7.log.tar.gz) - **loader-set-5dffaac7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/loader-set-5dffaac7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/loader-set-5dffaac7.tar.gz) - **monitor-set-5dffaac7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/monitor-set-5dffaac7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/monitor-set-5dffaac7.tar.gz) - **parallel-timelines-report-5dffaac7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/parallel-timelines-report-5dffaac7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5dffaac7-04b4-473d-a574-c53d59bfd567/20240922_153805/parallel-timelines-report-5dffaac7.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-6.2/job/gemini/job/gemini-3h-with-nemesis-test/4/) [Argus](https://argus.scylladb.com/test/eecf23d9-ad04-47ab-9a3a-b41319625e95/runs?additionalRuns[]=5dffaac7-04b4-473d-a574-c53d59bfd567)
timtimb0t commented 1 month ago

Here is the similar issue: https://github.com/scylladb/scylladb/issues/7937

temichus commented 1 month ago

reproduced in the latest 6.2.0

Packages

Scylla version: 6.2.0-20241013.b8a9fd4e49e8 with build-id a61f658b0408ba10663812f7a3b4d6aea7714fac

Kernel Version: 6.8.0-1016-aws

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0cfa45340a2dfa18d (aws: undefined_region)

Test: gemini-3h-with-nemesis-test Test id: 2ce47ce4-9395-47a3-9b59-97305baf93ca Test name: scylla-6.2/gemini/gemini-3h-with-nemesis-test Test method: gemini_test.GeminiTest.test_load_random_with_nemesis Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 2ce47ce4-9395-47a3-9b59-97305baf93ca` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=2ce47ce4-9395-47a3-9b59-97305baf93ca) - Show all stored logs command: `$ hydra investigate show-logs 2ce47ce4-9395-47a3-9b59-97305baf93ca` ## Logs: - **db-cluster-2ce47ce4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/db-cluster-2ce47ce4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/db-cluster-2ce47ce4.tar.gz) - **sct-runner-events-2ce47ce4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/sct-runner-events-2ce47ce4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/sct-runner-events-2ce47ce4.tar.gz) - **sct-2ce47ce4.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/sct-2ce47ce4.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/sct-2ce47ce4.log.tar.gz) - **loader-set-2ce47ce4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/loader-set-2ce47ce4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/loader-set-2ce47ce4.tar.gz) - **monitor-set-2ce47ce4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/monitor-set-2ce47ce4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/monitor-set-2ce47ce4.tar.gz) - **parallel-timelines-report-2ce47ce4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/parallel-timelines-report-2ce47ce4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/2ce47ce4-9395-47a3-9b59-97305baf93ca/20241013_232644/parallel-timelines-report-2ce47ce4.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-6.2/job/gemini/job/gemini-3h-with-nemesis-test/7/) [Argus](https://argus.scylladb.com/test/eecf23d9-ad04-47ab-9a3a-b41319625e95/runs?additionalRuns[]=2ce47ce4-9395-47a3-9b59-97305baf93ca)
aleksbykov commented 1 month ago

@bhalevy @nyh , We got the reproduce of https://github.com/scylladb/scylla-enterprise/issues/3663 several times for 6.2. As i understood it could be expected behavior. I saw that it was fixed in gemini: https://github.com/scylladb/scylladb/issues/7937. We need you opinion could we ignore such issue and reopen bug to gemini repo?

nyh commented 1 month ago

This was supposedly fixed in Gemini https://github.com/scylladb/gemini/commit/fd281582bcae7d2d2f052e66bf207115836350d2 so how did it come back? And why is this specific to 6.2?

https://github.com/scylladb/scylladb/issues/3559#issuecomment-2188983326 explains why two appends to a list using the same user-side timestamp would be duplicated if they happen to reach two different coordinators - because the "timeuuid" used as the list key uses not just the timestamp but also the node id.

nyh commented 1 month ago

@aleksbykov somebody will need to investigate the Gemini failure and explain in English words (and not opaque Gemini output) what exactly is the problem seen. Is it even a list (as was in the issues mentioned above) or a map as the error message suggests? What does Gemini do on this map and why does it not expect the output it gets?

aleksbykov commented 1 month ago

@nyh , map is just gemini output it shows difference between rows in test and oracle cluster. The main difference which caused the the gemini error is duplicating values in list column:

col5:[7988919290375467742 4376061058815166327 379500719263269858 5009093180938296012 7174863172911282077 6522356728298859312 599653341271145931 3490621354264608579 7939102450213922708]

col5:[7988919290375467742 7988919290375467742 4376061058815166327 4376061058815166327 379500719263269858 379500719263269858 5009093180938296012 5009093180938296012 7174863172911282077 7174863172911282077 6522356728298859312 6522356728298859312 599653341271145931 599653341271145931 3490621354264608579 3490621354264608579 7939102450213922708 7939102450213922708]
nyh commented 1 month ago

Somebody will need to debug Gemini to understand why this happens. As I noted here and in various issues mentioned above, we know that there are cases when the list item may be added more than once. https://github.com/scylladb/scylladb/issues/3559#issuecomment-2188983326 is one such known case: When the same list append gets sent to different coordinator shards (if using server-side timestamps, even the same coordinator will cause this problem).

Although this is arguably a Scylla bug - a bug that should be recognized but will probably never be fixed - I don't know whether Gemini is supposed to "reproduce" this bug every time. It seems to me that https://github.com/scylladb/gemini/commit/fd281582bcae7d2d2f052e66bf207115836350d2 tried to make this "reproduction" less likely, but I'm not sure it completely solves the problem - is it possible that Gemini or the driver it uses retries an append operation and that is what causes the double append? https://github.com/scylladb/scylladb/issues/3559 suggests maybe the application (Gemini) or its driver is doing this retry, and that list appends are not idempotent and must not be retried. But does Gemini or its driver know that?

Again, somebody who is familiar with Gemini and the driver it uses should look into this, I am not sure how to help here.

timtimb0t commented 4 days ago

Due to the issue been spotted on several runs I changed the label to tier1, but the issue itself is not critical

mykaul commented 4 days ago

Due to the issue been spotted on several runs I changed the label to tier1, but the issue itself is not critical

@timtimb0t - are we sure it's not a Gemini issue?

timtimb0t commented 4 days ago

reproduced there:

Packages

Base Scylla version: 6.0.4-20240919.7d421abec43d with build-id 7cc42fc19bcc383fe3723163554d80edaae8b8a5 Target Scylla version (or git commit hash): 2024.2.0-20241118.614d56348f46 with build-id 2f88720b7edc44997fa233aef003e675dd1a7cee

Kernel Version: 5.15.0-1070-aws

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (im4gn.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0ac9f799d0f225bd4 (aws: undefined_region)

Test: rolling-upgrade-ami-arm-test Test id: fb117102-22bf-436c-aa13-3cec8d3fb1db Test name: enterprise-2024.2/rolling-upgrade/rolling-upgrade-ami-arm-test Test method: upgrade_test.UpgradeTest.test_rolling_upgrade Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor fb117102-22bf-436c-aa13-3cec8d3fb1db` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=fb117102-22bf-436c-aa13-3cec8d3fb1db) - Show all stored logs command: `$ hydra investigate show-logs fb117102-22bf-436c-aa13-3cec8d3fb1db` ## Logs: - **db-cluster-fb117102.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/db-cluster-fb117102.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/db-cluster-fb117102.tar.gz) - **sct-runner-events-fb117102.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/sct-runner-events-fb117102.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/sct-runner-events-fb117102.tar.gz) - **sct-fb117102.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/sct-fb117102.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/sct-fb117102.log.tar.gz) - **loader-set-fb117102.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/loader-set-fb117102.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/loader-set-fb117102.tar.gz) - **monitor-set-fb117102.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/monitor-set-fb117102.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/fb117102-22bf-436c-aa13-3cec8d3fb1db/20241118_131304/monitor-set-fb117102.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2024.2/job/rolling-upgrade/job/rolling-upgrade-ami-arm-test/12/) [Argus](https://argus.scylladb.com/test/4a669c4a-ed36-46fa-b01a-665f055462b4/runs?additionalRuns[]=fb117102-22bf-436c-aa13-3cec8d3fb1db)
timtimb0t commented 4 days ago

Due to the issue been spotted on several runs I changed the label to tier1, but the issue itself is not critical

@timtimb0t - are we sure it's not a Gemini issue?

we are sure that its Gemini issue

timtimb0t commented 4 days ago

https://argus.scylladb.com/tests/scylla-cluster-tests/a6b19c4e-631b-4c65-b349-4a52fdca2334

mykaul commented 4 days ago

Due to the issue been spotted on several runs I changed the label to tier1, but the issue itself is not critical

@timtimb0t - are we sure it's not a Gemini issue?

we are sure that its Gemini issue

Then why it's in this repo?

timtimb0t commented 3 days ago

Due to the issue been spotted on several runs I changed the label to tier1, but the issue itself is not critical

@timtimb0t - are we sure it's not a Gemini issue?

we are sure that its Gemini issue

Then why it's in this repo?

I mistakenly created it there because i didnt know that its different component at that time. I can recreate it in a proper repo, should I?