Closed fruch closed 12 months ago
More times it happened:
Kernel Version: 5.13.0-1021-aws
Scylla version (or git commit hash): 5.0~rc3-20220406.f92622e0d
with build-id 2b79c4744216b294fdbd2f277940044c899156ea
Cluster size: 6 nodes (i3.large)
OS / Image: ami-0e4ae5e4a139c50f3
(aws: eu-north-1)
Test: longevity-harry-2h-test
Test id: a4bdcc57-c0d5-4e25-8ab7-56a32f1d6a40
$ hydra investigate show-monitor a4bdcc57-c0d5-4e25-8ab7-56a32f1d6a40
$ hydra investigate show-logs a4bdcc57-c0d5-4e25-8ab7-56a32f1d6a40
Can we try and run this on 4.6 - to see if this is a regression
@fruch ^^
@eliransin try to find someone that can help look at this
@slivne @eliransin
Happens also on 4.6.3, so it's no regression:
Kernel Version: 5.11.0-1022-aws
Scylla version (or git commit hash): 4.6.3-20220414.8bf149fdd
with build-id 8d16d8972498cc769071ff25309b009eb77bf77a
Cluster size: 6 nodes (i3.large)
OS / Image: ami-010abbe835b052d37
(aws: eu-north-1)
Test: longevity-harry-2h-test
Test id: fc555304-c2e8-4e9d-8c0c-9d0e8f37260d
Test name: scylla-staging/fruch/longevity-harry-2h-test
Test config file(s):
Restore Monitor Stack command: $ hydra investigate show-monitor fc555304-c2e8-4e9d-8c0c-9d0e8f37260d
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs fc555304-c2e8-4e9d-8c0c-9d0e8f37260d
The same happened during destroy_data_then_repair
disruption. More test details:
Kernel Version: 5.13.0-1025-aws
Scylla version (or git commit hash): 5.0~rc6-20220523.338edcc02
with build-id 60217f35371db2b1283e0c5bc67a7f5604768d41
Cluster size: 6 nodes (i3.large)
Scylla Nodes used in this run:
OS / Image: ami-07b5745a1e6de34ce
(aws: eu-west-1)
Test: longevity-harry-2h-test
Test id: 73b768fb-6b82-4e6e-ba97-408f305b5b1b
Test name: scylla-5.0/longevity/longevity-harry-2h-test
Test config file(s):
Restore Monitor Stack command: $ hydra investigate show-monitor 73b768fb-6b82-4e6e-ba97-408f305b5b1b
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 73b768fb-6b82-4e6e-ba97-408f305b5b1b
@eliransin will someone look at it? I'm setting the high lable since otherwise no one will ever fix it. And if no one is going to fix it, we can just stop running the test.
@roydahan does it happen all the time. Can I have a reproducer job so I can debug? Did we or can we run it against Cassandra to eliminate a bug it cassandra-harry? if the same failure occure in cassandra, this would mean one of the two:
@roydahan does it happen all the time. Can I have a reproducer job so I can debug?
you can use https://jenkins.scylladb.com/view/master/job/scylla-master/job/reproducers/job/longevity-harry-2h-test/, it a 2.5h run.
Did we or can we run it against Cassandra to eliminate a bug it cassandra-harry?
we don't have SCT setup to run Cassandra with our nemesis
if the same failure occure in cassandra, this would mean one of the two:
- Bug in cassandra-harry
- Bug in cassandra HL design which we replicated (at least to some extent) - I think this is unlikely.
Again, just to add, in steadily state, without restarting nodes, cassandra-harry verification is working correctly. So we can only look at the querying the tool is doing, and see that it's doing inserts, and then scans of the last 100 item it's inserted, and compares that to what's it has in memory (very similar to Gemini, but without an oracle)
Where is the cassandra-harry log (on which tar file?)
Where is the cassandra-harry log (on which tar file?)
in the loader-set
@eliransin are we planning to fix this for 5.1?
I am not sure we will have the bandwidth for it for 5.1
Pushing to 5.2 as we don't have the bandwidth to deal with it now. Per @fruch it is not a regression
/Cc @roydahan @slivne
Pushed as a waterfall from version to version. Dropping this case to the large backlog (5.x) until we will have a commitment for that.
/Cc @eliransin @mykaul
The problem is important enough we'll analyze and solve it. Just pushing it to more versions is useless. First QA need to analyze this, it's not enough to report a bug, there is a chance the test isn't good enough - let's see that the test accounts write acknowledgements an there are no timeouts in the write/read. Should be done by the code analysis (I think that c-s ignores errors/timeouts and doesn't retry) and looking at the logs
The problem is important enough we'll analyze and solve it. Just pushing it to more versions is useless. First QA need to analyze this, it's not enough to report a bug, there is a chance the test isn't good enough - let's see that the test accounts write acknowledgements an there are no timeouts in the write/read. Should be done by the code analysis (I think that c-s ignores errors/timeouts and doesn't retry) and looking at the logs
This isn't cassandra-stress, it didn't see any code path that should be ignoring errors or timeouts, it has a crude retry of 10 times. One could have a look at the monitor data to see if there were write/read errors (I don't recall we've seen such)
I don't know which more analysis we can do in this case, exactly the reason why we open such an issue, to continue the discussion.
If the answer is it's not important enough and no one has time to help, 🤷♂️
If all possible errors are logged and there is nothing in the log, you need to pass it to R&D. If it's not the case, keep on digging and modify the test.
This is important enough to invest in
On Tue, Jan 24, 2023 at 9:20 PM Israel Fruchter @.***> wrote:
The problem is important enough we'll analyze and solve it. Just pushing it to more versions is useless. First QA need to analyze this, it's not enough to report a bug, there is a chance the test isn't good enough - let's see that the test accounts write acknowledgements an there are no timeouts in the write/read. Should be done by the code analysis (I think that c-s ignores errors/timeouts and doesn't retry) and looking at the logs
This isn't cassandra-stress, it didn't see any code path that should be ignoring errors or timeouts, it has a crude retry of 10 times. One could have a look at the monitor data to see if there were write/read errors (I don't recall we've seen such)
I don't know which more analysis we can do in this case, exactly the reason why we open such an issue, to continue the discussion.
If the answer is it's not important enough and no one has time to help, 🤷♂️
— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylladb/issues/10598#issuecomment-1402468274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHURP5M5QVUZ6AANB7YULWUATQTANCNFSM5WHJ5KZA . You are receiving this because you commented.Message ID: @.***>
Got it again in 2023.1.0-rc2
Kernel Version: 5.15.0-1031-aws
Scylla version (or git commit hash): 2023.1.0~rc2-20230302.726f8a090337
with build-id efcc950cda22875aa488c5295c6cc68f35d9cc6f
Cluster size: 6 nodes (i3.large)
Scylla Nodes used in this run:
OS / Image: ami-0336a5a646b08dc25
(aws: eu-west-1)
Test: longevity-harry-2h-test
Test id: 10a155f9-f3c2-4e45-b516-51ee8d98edcf
Test name: enterprise-2023.1/longevity/longevity-harry-2h-test
Test config file(s):
@DoronArazii
@ptrsmrn did we had any fixes around this in the last 6 months?
@cvybhu is looking into it
Reading the cassandra harry logs, specifically cassandra-harry-l0-df244f51-e953-4039-8b6f-d08eabfc4243.log
from the loader set, I see that just before the validation error there was an error saying that an INSERT statement has failed:
ERROR 12:36:52 Caught message while trying to execute CompiledStatement{cql='BEGIN UNLOGGED BATCH
INSERT INTO harry.table0 (pk0000,pk0001,pk0002,ck0000,ck0001,regular0000,regular0001) VALUES (?, ?, ?, ?, ?, ?, ?) USING TIMESTAMP 1652791011659406; UPDATE harry.table0 USING TIMESTAMP 1652791011659406 SET regular0001 = ?, regular0002 = ? WHERE pk0000 = ? AND pk0001 = ? AND pk0002 = ? AND ck0000 = ? AND ck0001 = ?;
APPLY BATCH;', bindings=2450526259L,"ZHHyABdiABdiABdiABdiABdibMpLLWYc246338536017219215786200249516516921755301071520113020085215215133154139102441651501722212916516878952652107670141217518237232138187448255471668243130146116331434722023121319016822719423910121153122207461854648122552532184423923024113411364233331986924958207911001111961088354138251159261119688181249669118342164192352792541321511311811321615620924251722271118814155922622710920022886118231891034244135126130207161239521312221819318015162235141108572412619882371622101887411013448248182572525220388","ZHHyABdiABdiABdiABdiABdijpiHlFPu18810071292471623018914611020917537255129295411201312172422501021331684211625292220223167178108254902252285854386",865384412096L,(float)2.0980271E-38,-1417954105,(short)-14460,(short)19283,965534980,2450526259L,"ZHHyABdiABdiABdiABdiABdibMpLLWYc246338536017219215786200249516516921755301071520113020085215215133154139102441651501722212916516878952652107670141217518237232138187448255471668243130146116331434722023121319016822719423910121153122207461854648122552532184423923024113411364233331986924958207911001111961088354138251159261119688181249669118342164192352792541321511311811321615620924251722271118814155922622710920022886118231891034244135126130207161239521312221819318015162235141108572412619882371622101887411013448248182572525220388","ZHHyABdiABdiABdiABdiABdijpiHlFPu18810071292471623018914611020917537255129295411201312172422501021331684211625292220223167178108254902252285854386",289220137400L,(float)1.5767209E-38}
com.datastax.driver.core.exceptions.TransportException: [/10.0.3.136:9042] Connection has been closed
at com.datastax.driver.core.Connection$ConnectionCloseFuture.force(Connection.java:1397)
at com.datastax.driver.core.Connection$ConnectionCloseFuture.force(Connection.java:1378)
at com.datastax.driver.core.Connection.defunct(Connection.java:570)
at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1328)
at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1318)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:512)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:505)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:484)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:425)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1093)
at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:710)
at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:686)
at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:557)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.closeOnRead(AbstractNioByteChannel.java:71)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:546)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
ERROR [pool-3-thread-2] instance_id_IS_UNDEFINED 2022-05-17 12:36:52,344 MutatingVisitor.java:164 - Caught message while trying to execute CompiledStatement{cql='BEGIN UNLOGGED BATCH
INSERT INTO harry.table0 (pk0000,pk0001,pk0002,ck0000,ck0001,regular0000,regular0001) VALUES (?, ?, ?, ?, ?, ?, ?) USING TIMESTAMP 1652791011659406; UPDATE harry.table0 USING TIMESTAMP 1652791011659406 SET regular0001 = ?, regular0002 = ? WHERE pk0000 = ? AND pk0001 = ? AND pk0002 = ? AND ck0000 = ? AND ck0001 = ?;
APPLY BATCH;', bindings=2450526259L,"ZHHyABdiABdiABdiABdiABdibMpLLWYc246338536017219215786200249516516921755301071520113020085215215133154139102441651501722212916516878952652107670141217518237232138187448255471668243130146116331434722023121319016822719423910121153122207461854648122552532184423923024113411364233331986924958207911001111961088354138251159261119688181249669118342164192352792541321511311811321615620924251722271118814155922622710920022886118231891034244135126130207161239521312221819318015162235141108572412619882371622101887411013448248182572525220388","ZHHyABdiABdiABdiABdiABdijpiHlFPu18810071292471623018914611020917537255129295411201312172422501021331684211625292220223167178108254902252285854386",865384412096L,(float)2.0980271E-38,-1417954105,(short)-14460,(short)19283,965534980,2450526259L,"ZHHyABdiABdiABdiABdiABdibMpLLWYc246338536017219215786200249516516921755301071520113020085215215133154139102441651501722212916516878952652107670141217518237232138187448255471668243130146116331434722023121319016822719423910121153122207461854648122552532184423923024113411364233331986924958207911001111961088354138251159261119688181249669118342164192352792541321511311811321615620924251722271118814155922622710920022886118231891034244135126130207161239521312221819318015162235141108572412619882371622101887411013448248182572525220388","ZHHyABdiABdiABdiABdiABdijpiHlFPu18810071292471623018914611020917537255129295411201312172422501021331684211625292220223167178108254902252285854386",289220137400L,(float)1.5767209E-38}
com.datastax.driver.core.exceptions.TransportException: [/10.0.3.136:9042] Connection has been closed
at com.datastax.driver.core.Connection$ConnectionCloseFuture.force(Connection.java:1397)
at com.datastax.driver.core.Connection$ConnectionCloseFuture.force(Connection.java:1378)
at com.datastax.driver.core.Connection.defunct(Connection.java:570)
at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1328)
at com.datastax.driver.core.Connection$ChannelCloseListener.operationComplete(Connection.java:1318)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:512)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:505)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:484)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:425)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1093)
at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:710)
at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:686)
at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:557)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.closeOnRead(AbstractNioByteChannel.java:71)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:546)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:500)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
A moment later there's the validation error:
ERROR 12:37:59 Dumping results into the file:failure.dump
ERROR [pool-4-thread-1] instance_id_IS_UNDEFINED 2022-05-17 12:37:59,796 Runner.java:497 - Dumping results into the file:failure.dump
ERROR 12:37:59 Execution failed!
java.lang.RuntimeException: java.util.concurrent.ExecutionException: harry.model.Model$ValidationException: Found a row in the model that is not present in the resultset:
Expected: RowState{cd=-2680195869748707583, vds=[-9223372036854775808, -9223372036854775808, 3295175451, 16481971], lts=[-9223372036854775808, -9223372036854775808, 2450810, 2450810], clustering=[390003691143, 3.741939E-39], values=[null, null, -999791845, 2.309616E-38]}
Actual: resultSetRow(-4848934879607217747L, -2681267964583869968L, values(64895744886884297L,191L,40452L,236L,223L), lts(2450990L,2450980L,2450990L,2450990L,2450990L), values(4217329462L,-9223372036854775808L,-9223372036854775808L,-9223372036854775808L), lts(2450960L,-9223372036854775808L,-9223372036854775808L,-9223372036854775808L))
Query: CompiledStatement{cql='SELECT pk0000, pk0001, pk0002, ck0000, ck0001, static0000, static0001, static0002, static0003, static0004, regular0000, regular0001, regular0002, regular0003, writetime(static0000), writetime(static0001), writetime(static0002), writetime(static0003), writetime(static0004), writetime(regular0000), writetime(regular0001), writetime(regular0002), writetime(regular0003) FROM harry.table0 WHERE pk0000 = ? AND pk0001 = ? AND pk0002 = ? AND ck0000 <= ? ORDER BY ck0000 DESC, ck0001 DESC;', bindings=1018503018L,"ZHHyABdiABdiABdiABdiABdixSFhALUT131226229842317120712319025208204148451361407410623224313022030891841822481141252399720661253252313515918524797108172171301914840","ZHHyABdiABdiABdiABdiABdifnZdiaVk103382413760411412106713510217112919880114149917513323115357119998916316360104240131139205892235117330841991652825411564452497751881811785110459131256514141592280160230129126188131123112851591494119236226018715199538947122289328114149320312725519612945541115716118075143105108181881081138118216281584021210614121127117118024521681294892159139451221318946382572382212031622032491059829171116187561294422671177713924912017784243197411258518793226145151232191892264711311025525011121071011961011212251104248178751402482241711697416221117684180292991046121445150153150151208161143122244208169387223817445251147131371372614253261137980243204116902031165089202125121475821421933247242160146945070221761613015441282122026712916814120193171258831184227254822522242321311312462245120452411579841081291671821121499201661821971872375111387205467943541931971126745891711777467345418214338128771661238916158194782001920663921392174158146121866824110211325220524421164103176157231104226145890146199992722199818621019913492129107162554132213758120193",818566072796L}
Partition state:
Static row: RowSta
At the first glance it looks like a bug in Cassandra Harry.
A node goes down due to rolling restart, and because of that inserting a row fails with ConnectionClosed
exception. The row doesn't get inserted, and because of this the validation fails.
The driver should retry all idempotent queries that failed due to IO errors, at least that's what the rust driver does.
Okay, it looks like the MutatingVisitor
retries each query 10 times:
https://github.com/apache/cassandra-harry/blob/5570c254df4fd6495c864f4021970ae005a62ce5/harry-core/src/harry/visitors/MutatingVisitor.java
After ten tries it would write "Can not execute statement %s after %d retries", but there is no such message in the logs, which means that all the queries succeeded.
So the initial idea about harry bug was most likely wrong.
So the row is successfully written, but it isn't visible during read.
One more idea - is scylla configured with durable_writes = true
?
If the database responds before the data is actually written to persistent storage, we could end up losing data. Imagine a scenario like this: 1) A is down, B and C are up 2) Harry writes a row with CL=QUORUM, it reaches B and C 3) A comes back up, doesn't know about the row 4) B goes down before the change is persisted, it forgets about the row 5) B goes back up 7) Harry reads with CL=QUORUM, it reaches A and B, both of them don't know about the row 8) The read doesn't return the row
I heard that there are multiple modes for the commitlog, maybe the default mode doesn't persist the changes?
Well the disconnect backtraces are normal, and we see them in c-s when nodes go down.
Looking at the scylla.yaml, there's commitlog_sync: periodic
and commitlog_sync_period_in_ms: 10000
. AFAIU this means that we save the commitlog to disk every 10 seconds.
But what happens if the node is restarted before the 10 seconds pass. Does it flush the commitlog one last time? If not, then we could lose some data.
api_address: 127.0.0.1
api_doc_dir: /opt/scylladb/api/api-doc/
api_ui_dir: /opt/scylladb/swagger-ui/dist/
batch_size_fail_threshold_in_kb: 1024
batch_size_warn_threshold_in_kb: 128
broadcast_rpc_address: 10.0.3.210
cluster_name: longevity-harry-2h-test-har-db-cluster-5ad966f5
commitlog_segment_size_in_mb: 32
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch
experimental: true
listen_address: 10.0.3.210
num_tokens: 256
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
prometheus_address: 0.0.0.0
rpc_address: 10.0.3.210
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: 10.0.3.210
Looking at the scylla.yaml, there's
commitlog_sync: periodic
andcommitlog_sync_period_in_ms: 10000
. AFAIU this means that we save the commitlog to disk every 10 seconds.But what happens if the node is restarted before the 10 seconds pass. Does it flush the commitlog one last time? If not, then we could lose some data.
api_address: 127.0.0.1 api_doc_dir: /opt/scylladb/api/api-doc/ api_ui_dir: /opt/scylladb/swagger-ui/dist/ batch_size_fail_threshold_in_kb: 1024 batch_size_warn_threshold_in_kb: 128 broadcast_rpc_address: 10.0.3.210 cluster_name: longevity-harry-2h-test-har-db-cluster-5ad966f5 commitlog_segment_size_in_mb: 32 commitlog_sync: periodic commitlog_sync_period_in_ms: 10000 endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch experimental: true listen_address: 10.0.3.210 num_tokens: 256 partitioner: org.apache.cassandra.dht.Murmur3Partitioner prometheus_address: 0.0.0.0 rpc_address: 10.0.3.210 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: 10.0.3.210
Loose data from one node is quite normal We are not reading the data here with ONE
Loose data from one node is quite normal We are not reading the data here with ONE
The problem is that we might lose data on two nodes. One node is down so it doesn't know about the change, and the other could forget the change during the restart
Hmm I ran this test 5 times (3, 4, 5, 6, 7) and it didn't fail.
Maybe the issue has been fixed in the meantime?
It's more of the fact the specific nemesis in question wasn't run, SCT "randomly" (based on a seed) picks the order of nemesis being used.
So we can change the test to run only the rolling restary nemesis
Or we can supply a specific nemesis seed that would have that nemesis for the current SCT master. (Unfortunately it's still a moving target)
@cvybhu
I'm running it with a seed that would run the rolling restart nemesis
https://jenkins.scylladb.com/job/scylla-master/job/reproducers/job/longevity-harry-2h-test/8/
and lets it if it still happening, also last report of it was on 2023.1 rc, so if it clears on master, and wouldn't on 2023.1, we still need to figure it out
Damn, it didn't exactly run the sequence I wanted to recreate, but it did fail similarly but on disrupt_disable_binary_gossip_execute_major_compaction
this nemesis is fairly new, but it does something very simple on one node only
disablebinary
and disablegossip
, and then compaction
introduced in: https://github.com/scylladb/scylla-cluster-tests/pull/6478
anyhow again, disabling one node shouldn't break a test tool that works in cl=QUORUM.
Kernel Version: 5.15.0-1044-aws
Scylla version (or git commit hash): 5.4.0~dev-20230910.0656810c2877
with build-id 2156440484bcf26ee80adb62c33ea5d7f689a57c
Cluster size: 6 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image: ami-023effc126ea4f201
(aws: undefined_region)
Test: longevity-harry-2h-test
Test id: 0f07e6be-f5ad-412f-9788-39cc5857868b
Test name: scylla-master/reproducers/longevity-harry-2h-test
Test config file(s):
Thank you, so it does reproduce on the newest master.
I think it would be good to run another run with a slightly modified scylla.yaml
.
I would like to replace:
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
With:
commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2
This should make all of the writes durable, so there would be no excuse for losing data when a node stops. If this doesn't fail, then that would confirm the theory that commitlog isn't flushed when the node stops.
@fruch could you start another run with a modified scylla.yaml
? I can try to do it, but I have no idea how SCTs work, there's no scylla.yaml
argument in the Jenkins job. I would need some pointers.
@cvybhu see how to apply it in SCT configuration of the test: https://github.com/fruch/scylla-cluster-tests/commit/8eb2a892d631498e79ec8c3a938efa2fbfc2efdc
I've run it again in https://jenkins.scylladb.com/job/scylla-master/job/reproducers/job/longevity-harry-2h-test/10/
Ah that's a nice option, thanks!
@cvybhu see how to apply it in SCT configuration of the test: fruch/scylla-cluster-tests@8eb2a89
I've run it again in https://jenkins.scylladb.com/job/scylla-master/job/reproducers/job/longevity-harry-2h-test/10/
Damn it still failed, even with commitlog_sync: batch
.
Here are the logs: https://cloudius-jenkins-test.s3.amazonaws.com/2c6ef546-2c2a-42d5-ad35-e6430deb0bb3/20230913_145818/loader-set-2c6ef546.tar.gz
This means that either my theory was wrong, and it's not a commitlog problem at all, or that the problem also occurs with the batch
setting. It'll require further investigation.
Another idea: maybe rolling restart needs to perform nodetool drain
?
It looks like the docs say that you should run it before restarting, but the nemesis doesn't do it. There's even an existing issue about it: https://github.com/scylladb/scylla-cluster-tests/issues/5167
Another idea: maybe rolling restart needs to perform
nodetool drain
?It looks like the docs say that you should run it before restarting, but the nemesis doesn't do it. There's even an existing issue about it: https://github.com/scylladb/scylla-cluster-tests/issues/5167
We could try, but there are quite lots of other cases, i.e. nemesis that we won't be gracefully shutting down scylla, like start/stop a single node, of killing a node with -9 signal.
I wonder what makes this tool validation more problematic in this case, then c-s, s-b or gemini.
Gemini works with other scylladb as oracle, and this one has a in-memory oracle, I wonder why its expectations are different.
@cvybhu running it with drain (even that I doubt it's related) https://jenkins.scylladb.com/job/scylla-master/job/reproducers/job/longevity-harry-2h-test/11/
@cvybhu running it with drain (even that I doubt it's related) https://jenkins.scylladb.com/job/scylla-master/job/reproducers/job/longevity-harry-2h-test/11/
@cvybhu yep, still failing, adding drain didn't changed the situation :
Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 5.4.0~dev-20230921.a56a4b6226e6
with build-id 616f734e7c7fb5e3ee8898792b3c415d2574a132
Cluster size: 6 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image: ami-00f051bf1c684c01a
(aws: undefined_region)
Test: longevity-harry-2h-test
Test id: 4fe8bf36-0412-4080-b0a9-e4f88ef6b3af
Test name: scylla-master/reproducers/longevity-harry-2h-test
Test config file(s):
I tried to make a reproducer using dtest
, but it didn't find any missing rows.
The code is here: https://gist.github.com/cvybhu/2b5e04f6f964a1ee2c2cbc6b1097ebef
It does a lot of rolling restarts while inserting rows and verifying that they exist in the table.
For some reason the test dies after about ~45mins every time I run it. No idea what's wrong, some timeout or something (?) -> log.txt. That doesn't really matter, what matters is that it doesn't find any missing rows in the first 45 mins.
I wasn't able to reproduce the issue using a simple reproducer, so I will dive deeper into Cassandra Harry and try to reproduce it locally using Harry and ccm.
I tried to make a reproducer using
dtest
, but it didn't find any missing rows.The code is here: https://gist.github.com/cvybhu/2b5e04f6f964a1ee2c2cbc6b1097ebef
It does a lot of rolling restarts while inserting rows and verifying that they exist in the table.
For some reason the test dies after about ~45mins every time I run it. No idea what's wrong, some timeout or something (?) -> log.txt. That doesn't really matter, what matters is that it doesn't find any missing rows in the first 45 mins.
I wasn't able to reproduce the issue using a simple reproducer, so I will dive deeper into Cassandra Harry and try to reproduce it locally using Harry and ccm.
Trying it locally is a good idea
I would try using the same docker image
2700 secs is the default dtest timeout, it can be enlarged with a decorator on the test.
Anyhow I would recommend opening a PR and we'll try helping to it
I got cassandra-harry
to run locally and I can confirm that it does reproduce locally. I made a script which starts a 6 node cluster and constantly runs rolling restarts while cassandra-harry
does its thing. The error happens almost immediately after starting. It doesn't reproduce when I disable the rolling restarts.
That's great news, I have something local that I can work with now :)
Here's the script:
import ccmlib
from ccmlib.scylla_cluster import ScyllaCluster
import time
cluster = ScyllaCluster(
"/home/jancio/.ccm",
"roller",
cassandra_version="release:2023.1.1",
force_wait_for_cluster_start=True,
manager=None,
skip_manager_server=None)
print("Starting the cluster...")
cluster.populate(6).start()
print("Cluster started!")
print("Sleeping 60 seconds to let the cluster settle...")
time.sleep(60)
print("Sleep done")
def rolling_restart():
print("Starting a rolling restart!")
for node in cluster.nodelist():
print(f"Restarting node {node.address()}")
node.stop(wait_other_notice=True)
node.start(wait_other_notice=True, wait_for_binary_proto=True)
# Sleep a bit, otherwise it doesn't work
time.sleep(30)
print("Rolling restart done")
while True:
rolling_restart()
time.sleep(30)
2700 secs is the default dtest timeout, it can be enlarged with a decorator on the test. Anyhow I would recommend opening a PR and we'll try helping to it
I only made this test to try and reproduce this bug, I didn't really mean for it to become an official test. Anyhow I have another reproducer now, so I'm gonna focus on that.
2700 secs is the default dtest timeout, it can be enlarged with a decorator on the test. Anyhow I would recommend opening a PR and we'll try helping to it
I only made this test to try and reproduce this bug, I didn't really mean for it to become an official test. Anyhow I have another reproducer now, so I'm gonna focus on that.
If it proven to be a scylla core issue, having a formal test would recommended. but local reproducer is a good start (I'll help to adapt it when and if needed)
I also ran the test using Cassandra, and it didn't fail.
I was hoping that this is something with cassandra-harry
, but this data points towards it really being Scylla's fault. I'll keep investigating.
Okay maybe it's something with Harry after all.
I reduced the test to make it smaller by changing external.yaml
to:
schema_provider:
fixed:
keyspace: harry
table: test_table
partition_keys:
pk1: smallint
clustering_keys:
ck1: smallint
regular_columns:
r1: smallint
static_keys:
...
runner:
sequential:
run_time: 2
run_time_unit: "HOURS"
visitors:
- logging:
row_visitor:
mutating: {}
- parallel_validate_recent_partitions:
partition_count: 1
queries_per_partition: 1
concurrency: 1
model:
quiescent_checker: {}
And now it fails immediately after starting up, on both Scylla and Cassandra, even without a rolling restart.
In operations.log
it's clear that it inserts these rows:
LTS: 0. Pd 28953. Cd 34363. M 0. OpId: 0 Statement CompiledStatement{cql='UPDATE harry.test_table USING TIMESTAMP 1696363940366000 SET r1 = ? WHERE pk1 = ? AND ck1 = ?;', bindings=(short)-23979,(short)-3815,(short)1595}
LTS: 0. Pd 28953. Cd 16514. M 0. OpId: 1 Statement CompiledStatement{cql='UPDATE harry.test_table USING TIMESTAMP 1696363940366000 SET r1 = ? WHERE pk1 = ? AND ck1 = ?;', bindings=(short)-8744,(short)-3815,(short)-16254}
LTS: 0. Pd 28953. Cd 54945. M 1. OpId: 2 Statement CompiledStatement{cql='INSERT INTO harry.test_table (pk1,ck1) VALUES (?, ?) USING TIMESTAMP 1696363940366000;', bindings=(short)-3815,(short)22177}
LTS: 0. Pd 28953. Cd 51946. M 1. OpId: 3 Statement CompiledStatement{cql='UPDATE harry.test_table USING TIMESTAMP 1696363940366000 SET r1 = ? WHERE pk1 = ? AND ck1 = ?;', bindings=(short)-23292,(short)-3815,(short)19178}
LTS: 0. Pd 28953. Finished
LTS: 1. Pd 43920. Cd 39188. M 0. OpId: 0 Statement CompiledStatement{cql='INSERT INTO harry.test_table (pk1,ck1,r1) VALUES (?, ?, ?) USING TIMESTAMP 1696363940366001;', bindings=(short)11152,(short)6420,(short)23445}
LTS: 1. Pd 43920. Cd 65320. M 0. OpId: 1 Statement CompiledStatement{cql='INSERT INTO harry.test_table (pk1,ck1,r1) VALUES (?, ?, ?) USING TIMESTAMP 1696363940366001;', bindings=(short)11152,(short)32552,(short)7293}
LTS: 1. Pd 43920. Cd 60286. M 1. OpId: 2 Statement CompiledStatement{cql='INSERT INTO harry.test_table (pk1,ck1) VALUES (?, ?) USING TIMESTAMP 1696363940366001;', bindings=(short)11152,(short)27518}
LTS: 1. Pd 43920. Cd 41517. M 1. OpId: 3 Statement CompiledStatement{cql='INSERT INTO harry.test_table (pk1,ck1) VALUES (?, ?) USING TIMESTAMP 1696363940366001;', bindings=(short)11152,(short)8749}
LTS: 1. Pd 43920. Finished
But Harry complains that the partition with pk1 = 1152
should be empty, which is obviously wrong.
ERROR [main] instance_id_IS_UNDEFINED 2023-10-03 22:12:20,475 HarryRunner.java:54 - Failed due to exception: java.util.concurrent.ExecutionException: harry.model.Model$ValidationException: Expected results to have the same number of results, but actual result iterator has more results.
Expected: []
Actual: [resultSetRow(43920L, 65320, values(7293L), lts(1L)), resultSetRow(43920L, 60286, values(-9223372036854775808L), lts(-9223372036854775808L)), resultSetRow(43920L, 41517, values(-9223372036854775808L), lts(-9223372036854775808L)), resultSetRow(43920L, 39188, values(23445L), lts(1L))]
Query: CompiledStatement{cql='SELECT pk1, ck1, r1, writetime(r1) FROM harry.test_table WHERE pk1 = ? ORDER BY ck1 DESC;', bindings=(short)11152}
Partition state:
Observed state:
resultSetRow(43920L, 65320, values(7293L), lts(1L))
resultSetRow(43920L, 60286, values(-9223372036854775808L), lts(-9223372036854775808L))
resultSetRow(43920L, 41517, values(-9223372036854775808L), lts(-9223372036854775808L))
resultSetRow(43920L, 39188, values(23445L), lts(1L))
Here's the full log (enrichened with some custom debug prints): small-fail-log.txt
It looks like some kind of race condition in Harry - it doesn't expect the rows that it has just inserted a moment ago. I'm digging into the code, trying to figure out what's going on there.
I reduced the test to make it smaller by changing external.yaml to:
It turns out that Harry doesn't correctly handle types smaller than 8 bytes.
Harry uses a clever random number generator which is able to work both ways. Given an index i
it generates random 64 bit value v
on that index. But it can also do the opposite - given a value v
it's able to determine the i
from which this value was generated.
Sadly there's a bug - when the partition key column is less than 64 bits Harry "adjusts entropy" by discarding all bytes above the column's size. So, if a column is of type smallint
it would only leave the lower 16 bits of the value, the rest gets zeroed out. Later when Harry tries to figure out what i
corresponds to this value it doesn't work because the value lost 48 bits. As a result it tries to calculate the wrong things and the test fails.
It's good to know, I'll probably go and open an issue in Cassandra's JIRA, but this doesn't answer the question why the original test failed. Conveniently all column types in the default configuration have > 64 bits, so they aren't affected by this bug ;)
Also doesn't answer how Cassandra was working o.k. as you reported vs. scylla
Installation details
Kernel Version: 5.13.0-1021-aws Scylla version (or git commit hash):
5.0~rc3-20220406.f92622e0d
with build-id2b79c4744216b294fdbd2f277940044c899156ea
Cluster size: 6 nodes (i3.large)Scylla Nodes used in this run:
OS / Image:
ami-0e4ae5e4a139c50f3
(aws: eu-north-1)Test:
longevity-harry-2h-test
Test id:5ad966f5-8ded-437d-9224-7601b95777f1
Test name:scylla-staging/fruch/longevity-harry-2h-test
Test config file(s):Issue description
During RollingRestartCluster, cassandra-harry fail in it's verification find only 19 key out of 134 expected ones. It's doing the query six times, and in all of them get the same results
This is the query, a prepare statment that runs with
ConsistencyLevel.QUORUM
:Form cassandra-harry log,
Partition state
is the expected result,Observed state
is the result from the query (notice that the expected is in reverse order then the observed)With nemesis running this is happens 3/4 time it run
On steady state, with nemesis cassandra-harry finish with success, on 5.0.rc3
On master we are facing a coredump without nemesis, so we can't compare (#10553)
$ hydra investigate show-monitor 5ad966f5-8ded-437d-9224-7601b95777f1
$ hydra investigate show-logs 5ad966f5-8ded-437d-9224-7601b95777f1
Logs:
Jenkins job URL