Open emdroid opened 2 weeks ago
There are two slightly different failure modes:
17:45:52,271 tools.stress DEBUG stress.py :160 | c-s: Sleeping 2s...
17:45:54,275 tools.stress DEBUG stress.py :160 | c-s: Running READ with 2 threads 2 minutes
17:45:54,344 tools.stress DEBUG stress.py :160 | c-s: ===== Using optimized driver!!! =====
17:45:54,717 tools.stress DEBUG stress.py :160 | c-s: Connected to cluster: test, max pending requests per connection null, max connections per host 8
17:45:54,718 tools.stress DEBUG stress.py :160 | c-s: Datatacenter: datacenter1; Host: /127.0.93.1; Rack: rack1
17:45:54,718 tools.stress DEBUG stress.py :160 | c-s: Datatacenter: datacenter1; Host: /127.0.93.2; Rack: rack1
17:45:54,718 tools.stress DEBUG stress.py :160 | c-s: Datatacenter: datacenter1; Host: /127.0.93.3; Rack: rack1
17:45:54,855 tools.stress DEBUG stress.py :160 | c-s: type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb
17:45:55,027 tools.stress DEBUG stress.py :160 | c-s: total, 279, 279, 279, 279, 0.8, 0.8, 1.1, 1.7, 4.2, 4.2, 1.0, 0.00000, 0, 0, 0, 0, 0, 0
17:45:56,014 tools.stress DEBUG stress.py :160 | c-s: total, 4540, 4261, 4261, 4261, 0.5, 0.4, 0.8, 1.3, 3.0, 4.0, 2.0, 0.62018, 0, 0, 0, 0, 0, 0
17:45:57,008 tools.stress DEBUG stress.py :160 | c-s: total, 10293, 5753, 5753, 5753, 0.3, 0.3, 0.6, 0.9, 1.8, 4.2, 3.0, 0.38883, 0, 0, 0, 0, 0, 0
17:45:58,006 tools.stress DEBUG stress.py :160 | c-s: total, 18942, 8649, 8649, 8649, 0.2, 0.2, 0.3, 0.5, 1.3, 2.1, 4.0, 0.31837, 0, 0, 0, 0, 0, 0
17:45:59,006 tools.stress DEBUG stress.py :160 | c-s: total, 28904, 9962, 9962, 9962, 0.2, 0.2, 0.3, 0.4, 0.8, 1.1, 5.0, 0.26379, 0, 0, 0, 0, 0, 0
17:46:00,005 tools.stress DEBUG stress.py :160 | c-s: total, 39487, 10583, 10583, 10583, 0.2, 0.1, 0.3, 0.4, 1.3, 27.1, 6.0, 0.22436, 0, 0, 0, 0, 0, 0
17:46:01,004 tools.stress DEBUG stress.py :160 | c-s: total, 51734, 12247, 12247, 12247, 0.2, 0.1, 0.3, 0.3, 0.5, 4.6, 7.0, 0.19871, 0, 0, 0, 0, 0, 0
17:46:01,792 stress_restart_test DEBUG stress_restart_test.py:54 | Restarting second node...
17:46:01,796 tools.stress DEBUG stress.py :160 | c-s: com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error happened while sending requests
17:46:01,796 tools.stress DEBUG stress.py :160 | c-s: com.datastax.driver.core.exceptions.TransportException: [127.0.93.2/127.0.93.2:9042] Connection has been closed
17:46:02,125 tools.stress DEBUG stress.py :160 | c-s: total, 62065, 10331, 10331, 10331, 0.1, 0.1, 0.2, 0.3, 0.4, 3.2, 8.0, 0.17141, 0, 0, 0, 0, 0, 0
17:46:02,805 tools.stress DEBUG stress.py :160 | c-s: WARN 17:46:02,804 Error creating netty channel to 127.0.93.2/127.0.93.2:9042
17:46:02,805 tools.stress DEBUG stress.py :160 | c-s: com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 127.0.93.2/127.0.93.2:9042
17:46:02,806 tools.stress DEBUG stress.py :160 | c-s: Caused by: java.net.ConnectException: Connection refused
...
17:46:29,788 ccm INFO cluster.py :760 | node2: Started scylla: pid: 1212336 17:46:29,788 ccm DEBUG cluster.py :757 | node2: Starting scylla-jmx: args=['/home/emilm/.dtest/dtest-a68r2l6w/test/node2/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.93.2', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.93.2', '-Dcom.sun.ma nagement.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.93.2', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.managem ent.jmxremote.ssl=false', '-jar', '/home/emilm/.dtest/dtest-a68r2l6w/test/node2/bin/scylla-jmx-1.0.jar'] 17:46:30,242 stress_restart_test DEBUG stress_restart_test.py:79 | Waiting until read stress thread will finish running
In the above, it can be noted that the c-s output ends soon after the "Restarting second node ..." log message, there is no more output and the wait times out.
2. The c-s completes after the 2m scheduled time, but goes stuck afterwards (still causing the test to timeout):
18:53:09,245 tools.stress DEBUG stress.py :160 | c-s: Sleeping 2s... 18:53:11,248 tools.stress DEBUG stress.py :160 | c-s: Running READ with 2 threads 2 minutes 18:53:11,307 tools.stress DEBUG stress.py :160 | c-s: ===== Using optimized driver!!! ===== 18:53:11,602 tools.stress DEBUG stress.py :160 | c-s: Connected to cluster: test, max pending requests per connection null, max connections per host 8 18:53:11,602 tools.stress DEBUG stress.py :160 | c-s: Datatacenter: datacenter1; Host: /127.0.15.1; Rack: rack1 18:53:11,602 tools.stress DEBUG stress.py :160 | c-s: Datatacenter: datacenter1; Host: /127.0.15.2; Rack: rack1 18:53:11,602 tools.stress DEBUG stress.py :160 | c-s: Datatacenter: datacenter1; Host: /127.0.15.3; Rack: rack1 18:53:11,695 tools.stress DEBUG stress.py :160 | c-s: type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb 18:53:12,014 tools.stress DEBUG stress.py :160 | c-s: total, 904, 904, 904, 904, 0.6, 0.6, 1.0, 1.2, 2.9, 2.9, 1.0, 0.00000, 0, 0, 0, 0, 0, 0 18:53:13,007 tools.stress DEBUG stress.py :160 | c-s: total, 6001, 5097, 5097, 5097, 0.4, 0.4, 0.6, 0.8, 1.6, 1.8, 2.0, 0.49388, 0, 0, 0, 0, 0, 0 18:53:14,005 tools.stress DEBUG stress.py :160 | c-s: total, 11348, 5347, 5347, 5347, 0.4, 0.3, 0.6, 0.8, 2.5, 8.2, 3.0, 0.31138, 0, 0, 0, 0, 0, 0 18:53:15,005 tools.stress DEBUG stress.py :160 | c-s: total, 17614, 6266, 6266, 6266, 0.3, 0.3, 0.5, 0.6, 0.9, 1.3, 4.0, 0.23457, 0, 0, 0, 0, 0, 0 18:53:16,006 tools.stress DEBUG stress.py :160 | c-s: total, 26322, 8708, 8708, 8708, 0.2, 0.2, 0.4, 0.5, 0.8, 1.8, 5.0, 0.21432, 0, 0, 0, 0, 0, 0 18:53:17,004 tools.stress DEBUG stress.py :160 | c-s: total, 33304, 6982, 6982, 6982, 0.3, 0.2, 0.5, 0.6, 0.9, 1.2, 6.0, 0.17578, 0, 0, 0, 0, 0, 0 18:53:18,005 tools.stress DEBUG stress.py :160 | c-s: total, 42037, 8733, 8733, 8733, 0.2, 0.2, 0.4, 0.5, 0.6, 1.8, 7.0, 0.15586, 0, 0, 0, 0, 0, 0 18:53:18,731 stress_restart_test DEBUG stress_restart_test.py:54 | Restarting second node... 18:53:18,735 tools.stress DEBUG stress.py :160 | c-s: com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error happened while sending requests 18:53:19,062 tools.stress DEBUG stress.py :160 | c-s: total, 48701, 6664, 6664, 6664, 0.2, 0.2, 0.4, 0.5, 0.7, 6.8, 8.0, 0.13510, 0, 0, 0, 0, 0, 0 18:53:19,734 tools.stress DEBUG stress.py :160 | c-s: WARN 18:53:19,734 Error creating netty channel to /127.0.15.2:9042
...
18:53:20,004 tools.stress DEBUG stress.py :160 | c-s: total, 55181, 6480, 6480, 6480, 0.1, 0.1, 0.2, 0.3, 2.5, 12.1, 9.0, 0.11935, 0, 0, 0, 0, 0, 0 18:53:21,003 tools.stress DEBUG stress.py :160 | c-s: total, 60081, 4900, 4900, 4900, 0.2, 0.2, 0.3, 0.4, 0.5, 0.7, 10.0, 0.11141, 0, 0, 0, 0, 0, 0 18:53:21,740 tools.stress DEBUG stress.py :160 | c-s: WARN 18:53:21,735 Error creating netty channel to /127.0.15.2:9042 18:53:21,746 tools.stress DEBUG stress.py :160 | c-s: com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.15.2:9042 18:53:21,751 tools.stress DEBUG stress.py :160 | c-s: Caused by: java.net.ConnectException: Connection refused 18:53:21,751 tools.stress DEBUG stress.py :160 | c-s: WARN 18:53:21,746 Error creating netty channel to 127.0.15.2/127.0.15.2:9042
...
18:53:44,002 tools.stress DEBUG stress.py :160 | c-s: total, 185153, 5940, 5940, 5940, 0.2, 0.2, 0.2, 0.3, 0.6, 0.9, 33.0, 0.04030, 0, 0, 0, 0, 0, 0 18:53:45,003 tools.stress DEBUG stress.py :160 | c-s: total, 190028, 4875, 4875, 4875, 0.2, 0.2, 0.2, 0.3, 1.2, 3.8, 34.0, 0.03945, 0, 0, 0, 0, 0, 0 18:53:46,003 tools.stress DEBUG stress.py :160 | c-s: total, 195488, 5460, 5460, 5460, 0.2, 0.2, 0.2, 0.3, 0.3, 1.0, 35.0, 0.03836, 0, 0, 0, 0, 0, 0 18:53:46,807 ccm INFO cluster.py :760 | node2: Started scylla: pid: 1234981 18:53:46,808 ccm DEBUG cluster.py :757 | node2: Starting scylla-jmx: args=['/home/emilm/.dtest/dtest-nsuqpind/test/node2/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.15.2', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.15.2', '-Dcom.sun.ma nagement.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.15.2', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.managem ent.jmxremote.ssl=false', '-jar', '/home/emilm/.dtest/dtest-nsuqpind/test/node2/bin/scylla-jmx-1.0.jar'] 18:53:47,002 tools.stress DEBUG stress.py :160 | c-s: total, 201321, 5833, 5833, 5833, 0.2, 0.2, 0.2, 0.2, 0.6, 2.7, 36.0, 0.03727, 0, 0, 0, 0, 0, 0 18:53:47,110 stress_restart_test DEBUG stress_restart_test.py:79 | Waiting until read stress thread will finish running 18:53:48,013 tools.stress DEBUG stress.py :160 | c-s: total, 207096, 5775, 5775, 5775, 0.2, 0.1, 0.2, 0.3, 0.4, 0.7, 37.0, 0.03624, 0, 0, 0, 0, 0, 0 18:53:49,004 tools.stress DEBUG stress.py :160 | c-s: total, 211181, 4085, 4085, 4085, 0.2, 0.2, 0.4, 0.6, 0.8, 1.2, 38.0, 0.03625, 0, 0, 0, 0, 0, 0 18:53:50,003 tools.stress DEBUG stress.py :160 | c-s: total, 215121, 3940, 3940, 3940, 0.2, 0.2, 0.4, 0.6, 1.0, 1.5, 39.0, 0.03636, 0, 0, 0, 0, 0, 0
...
18:55:10,003 tools.stress DEBUG stress.py :160 | c-s: total, 556028, 5482, 5482, 5482, 0.2, 0.2, 0.3, 0.4, 0.6, 2.4, 119.0, 0.02023, 0, 0, 0, 0, 0, 0 18:55:11,003 tools.stress DEBUG stress.py :160 | c-s: total, 560133, 4105, 4105, 4105, 0.2, 0.2, 0.4, 0.4, 0.5, 1.5, 120.0, 0.02010, 0, 0, 0, 0, 0, 0 18:55:12,003 tools.stress DEBUG stress.py :160 | c-s: total, 563287, 3154, 3154, 3154, 0.2, 0.2, 0.4, 0.4, 0.5, 1.3, 121.0, 0.02017, 0, 0, 0, 0, 0, 0 ---------------------------- Captured log teardown ----------------------------- 18:58:47,191 tools.stress DEBUG stress.py :160 | c-s: ERROR 18:58:47,190 [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds 18:58:47,193 tools.stress DEBUG stress.py :160 | c-s: ERROR 18:58:47,192 [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds 18:58:48,191 tools.stress DEBUG stress.py :160 | c-s: ERROR 18:58:48,191 [Control connection] Cannot connect to any host, scheduling retry in 2000 milliseconds 18:58:48,192 tools.stress DEBUG stress.py :160 | c-s: WARN 18:58:48,191 Error creating netty channel to 127.0.15.1/127.0.15.1:9042
In this case, the c-s seems to actually complete (the last output being approx the 2m after the c-s test started), even printing the output after the wait to finish started. But then it still goes stuck and the wait times out. Also, the c-s doesn't print the stats at the end, which it normally does.
cc @fruch
cc @dkropachev
@emdroid , thanks for reproducer, we will take a look at it
Note that it might also have something to do with this: https://github.com/scylladb/scylladb/issues/19645#issuecomment-2456656785
I prepared a simple reproducer in scylladb/scylla-dtest#5261 - fails 1-3 out of 10 in the older branches (2024.1 enterprise, scylla 5.4). With the current master the failure rate is much lower (sometimes just 1 of 25), probably because of some fixes that have been already done, but still seeing the error sometimes.
the branches mentioned here, are scylla or dtest ?
The branches are scylla, for dtest I used "next" with the scylla master and "branch-2024.1" with the scylla enterprise 2024.1
The branches are scylla, for dtest I used "next" with the scylla master and "branch-2024.1" with the scylla enterprise 2024.1
@emdroid Can you try running it again with Cassandra-Stress 3.15, support in it got merged 2 days ago and i'm not sure if you ran it with latest?
I tried to run the dtest after I built the c-s in docker + tagged the image as 3.15 locally, but I'm not sure whether it is actually being used (via the docker). Because if I run the dtest locally (I'm using the "dist-unified-dev" build of scylla and the "--scylla-version=local_tarball" when running the dtest), I can see the c-s processes directly on the machine (i.e. it seems it isn't being run there via the docker).
I.e. for example:
$ ps aux | grep "cassandra-stress"
emilm 895297 0.0 0.0 228692 3324 pts/31 S+ 21:40 0:00 /bin/sh /home/emilm/.ccm/scylla-repository/local_tarball/scylla-core-package/scylla-tools/tools/bin/cassandra-stress read duration=2m no-warmup -rate threads=2 -pop seq=1...10000 -node 127.0.13.1 -port jmx=7199
I know that the above is just the runner script, however I also have the java process running on the machine which seems to be the actual c-s process:
$ ps aux | grep Stress
emilm 895335 19.7 0.4 40184360 535204 pts/31 Sl+ 21:40 0:54 /usr/lib/jvm/java-21-openjdk-21.0.4.0.7-2.fc40.x86_64//bin/java -server -ea -cp <some_very_large_classpath_and_params> -Dcassandra.storagedir=/home/emilm/.ccm/scylla-repository/local_tarball/scylla-core-package/scylla-tools/tools/bin/../../data -Dlogback.configurationFile=logback-tools.xml org.apache.cassandra.stress.Stress read duration=2m no-warmup -rate threads=2 -pop seq=1...10000 -node 127.0.13.1 -port jmx=7199
So not really sure how to run the dtest so that it picks up the docker image (and I also don't know where the unified dist tarball gets the c-s from eventually)
I tried to run the dtest after I built the c-s in docker + tagged the image as 3.15 locally, but I'm not sure whether it is actually being used (via the docker). Because if I run the dtest locally (I'm using the "dist-unified-dev" build of scylla and the "--scylla-version=local_tarball" when running the dtest), I can see the c-s processes directly on the machine (i.e. it seems it isn't being run there via the docker).
I.e. for example:
$ ps aux | grep "cassandra-stress" emilm 895297 0.0 0.0 228692 3324 pts/31 S+ 21:40 0:00 /bin/sh /home/emilm/.ccm/scylla-repository/local_tarball/scylla-core-package/scylla-tools/tools/bin/cassandra-stress read duration=2m no-warmup -rate threads=2 -pop seq=1...10000 -node 127.0.13.1 -port jmx=7199
I know that the above is just the runner script, however I also have the java process running on the machine which seems to be the actual c-s process:
$ ps aux | grep Stress emilm 895335 19.7 0.4 40184360 535204 pts/31 Sl+ 21:40 0:54 /usr/lib/jvm/java-21-openjdk-21.0.4.0.7-2.fc40.x86_64//bin/java -server -ea -cp <some_very_large_classpath_and_params> -Dcassandra.storagedir=/home/emilm/.ccm/scylla-repository/local_tarball/scylla-core-package/scylla-tools/tools/bin/../../data -Dlogback.configurationFile=logback-tools.xml org.apache.cassandra.stress.Stress read duration=2m no-warmup -rate threads=2 -pop seq=1...10000 -node 127.0.13.1 -port jmx=7199
So not really sure how to run the dtest so that it picks up the docker image (and I also don't know where the unified dist tarball gets the c-s from eventually)
What branch of dtest is it ?
Commit that introduced docker-based c-s
: https://github.com/scylladb/scylla-dtest/commit/3e2457d856919c3fed724b8a5d01b5903727a76c
Here is the trace when c-s stuck on dtest: c-s-stuck-trage.log
As you can see main loop stuck at at app//org.apache.cassandra.stress.StressAction.run(StressAction.java:295
, which is:
done.await();
Which means some of the worker thread is stuck. Now here is the stacktrace of the stuck worker:
"Thread-2" prio=5 Id=59 WAITING on com.datastax.driver.core.DefaultResultSetFuture@2ad5ebfd
at java.base@11.0.25/jdk.internal.misc.Unsafe.park(Native Method)
- waiting on com.datastax.driver.core.DefaultResultSetFuture@2ad5ebfd
at java.base@11.0.25/java.util.concurrent.locks.LockSupport.park(Unknown Source)
at app//com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:561)
at app//com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:247)
at app//com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:308)
at app//com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
at app//org.apache.cassandra.stress.util.JavaDriverClient.executePrepared(JavaDriverClient.java:247)
at app//org.apache.cassandra.stress.operations.predefined.CqlOperation$JavaDriverWrapper.execute(CqlOperation.java:308)
Which means that thread is stuck on waiting DefaultResultSetFuture
results from the driver.
Thanks @dkropachev for following up on this.
Yes, I was able to also debug this eventually and found the same place. In particular, one of the read threads gets stuck after this message:
10:36:27,972 tools.stress_thread_docker DEBUG stress_thread_docker.py:167 | com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error happened while sending requests
10:36:27,972 tools.stress_thread_docker DEBUG stress_thread_docker.py:167 | com.datastax.driver.core.exceptions.TransportException: [127.0.7.2/127.0.7.2:9042] Connection has been closed
The current op completes, but the next one gets stuck.
Curiously this also only happens when there are 2 Consumer read threads - if I change the thread count to just 1, the problem cannot be reproduced.
This is the behavior observed in https://github.com/scylladb/scylladb/issues/16002 and also the likely reason of https://github.com/scylladb/scylladb/issues/16219 (although tht one hasn't been reproduced recently).
In particular, when the node is being restarted at the same time a c-s being run, it works when the node restart is initiated immediately (i.e. before the actual stress test begins to run - there is a 2s delay at the beginning of the c-s).
But if the node restart is initiated only after the c-s is already running (after the 2s delay), it often makes the c-s being stuck (not every time though, but fairly often).
I prepared a simple reproducer in https://github.com/scylladb/scylla-dtest/pull/5261 - fails 1-3 out of 10 in the older branches (2024.1 enterprise, scylla 5.4). With the current master the failure rate is much lower (sometimes just 1 of 25), probably because of some fixes that have been already done, but still seeing the error sometimes.