scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
52 stars 85 forks source link

Loaders exit with `AssertionError` after stress ends in the `longevity-10gb-3h-azure-test` CI job #6266

Open vponomaryov opened 1 year ago

vponomaryov commented 1 year ago

Prerequisites

Versions

Logs

Description

The scylla-5.2/longevity/longevity-10gb-3h-azure-test CI job works ok until the stress end after 3h of running. When the stress time end we get following error:

total,     789854673,   72119,   72119,   72119,    13.8,     7.6,    46.6,    82.3,   148.1,   297.8,10800.0,  0.00649,      0,      0,       0,       0,       0,       0
total,     789881585,   59494,   59494,   59494,    16.2,    10.4,    53.5,    75.9,    96.7,   112.0,10800.5,  0.00649,      0,      0,       0,       0,       0,       0

Results:
Op rate                   :   73,134 op/s  [WRITE: 73,134 op/s]
java.lang.AssertionError
Partition rate            :   73,134 pk/s  [WRITE: 73,134 pk/s]
    at com.datastax.driver.core.ConvictionPolicy$DefaultConvictionPolicy.signalConnectionClosed(ConvictionPolicy.java:90)
Row rate                  :   73,134 row/s [WRITE: 73,134 row/s]
    at com.datastax.driver.core.Connection.closeAsync(Connection.java:1095)
Latency mean              :   13.7 ms [WRITE: 13.7 ms]
    at com.datastax.driver.core.HostConnectionPool.discardAvailableConnections(HostConnectionPool.java:1011)
Latency median            :    8.3 ms [WRITE: 8.3 ms]
    at com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:972)
Latency 95th percentile   :   43.7 ms [WRITE: 43.7 ms]
    at com.datastax.driver.core.SessionManager.closeAsync(SessionManager.java:196)
Latency 99th percentile   :   73.6 ms [WRITE: 73.6 ms]
    at com.datastax.driver.core.Cluster$Manager.close(Cluster.java:1989)
Latency 99.9th percentile :  118.0 ms [WRITE: 118.0 ms]
    at com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:1560)
Latency max               : 42312.1 ms [WRITE: 42,312.1 ms]
    at com.datastax.driver.core.Cluster.closeAsync(Cluster.java:625)
Total partitions          : 789,881,585 [WRITE: 789,881,585]
    at com.datastax.driver.core.Cluster.close(Cluster.java:636)
Total errors              :          0 [WRITE: 0]
    at org.apache.cassandra.stress.util.JavaDriverClient.disconnect(JavaDriverClient.java:283)
Total GC count            : 0
    at org.apache.cassandra.stress.settings.StressSettings.disconnect(StressSettings.java:412)
Total GC memory           : 0.000 KiB
    at org.apache.cassandra.stress.StressAction.run(StressAction.java:98)
Total GC time             :    0.0 seconds
    at org.apache.cassandra.stress.Stress.run(Stress.java:143)
Avg GC time               :    NaN ms
    at org.apache.cassandra.stress.Stress.main(Stress.java:62)
StdDev GC time            :    0.0 ms
Total operation time      : 03:00:00

END

Second loader has the same java.lang.AssertionError.

It causes following critical event:

14:06:06  < t:2023-06-19 11:05:56,173 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo  docker exec f485c3b0a257404ba97e3ad65f697b1244e0986802eef2ba46883a34965e5663 /bin/sh -c 'echo TAG: loader_idx:0-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=IGU7I96F3584D8XA7J9A; cassandra-stress write no-warmup cl=QUORUM duration=180m -schema keyspace=keyspace1 '"'"'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)'"'"' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5 -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.9,10.0.0.10 -errors skip-unsupported-columns'"; Exit status: 1
14:06:06  < t:2023-06-19 11:05:56,303 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-06-19 11:05:56.274: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=d13c0e18-ee03-4ff0-b808-59c665985c15 during_nemesis=NodetoolDecommission duration=3h0m14s: node=Node longevity-10gb-3h-5-2-loader-node-9c9bf09e-eastus-1 [20.232.27.235 | 10.0.0.11] (seed: False)
14:06:06  < t:2023-06-19 11:05:56,303 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress write cl=QUORUM duration=180m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=1000 -pop seq=1..10000000 -log interval=5
14:06:06  < t:2023-06-19 11:05:56,303 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > errors:
14:06:06  < t:2023-06-19 11:05:56,303 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
14:06:06  < t:2023-06-19 11:05:56,303 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
14:06:06  < t:2023-06-19 11:05:56,303 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > com.datastax.driver.core.exceptions.Transport
14:06:06  F< t:2023-06-19 11:05:56,339 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-06-19 11:05:56.337: (InfoEvent Severity.NORMAL) period_type=not-set event_id=b425ee7a-896c-4337-a123-a1d13e871a79: message=TEST_END
14:06:06  < t:2023-06-19 11:05:56,355 f:tester.py       l:2778 c:LongevityTest        p:INFO  > TearDown is starting...

Steps to Reproduce

  1. Run the scylla-5.2/longevity/longevity-10gb-3h-azure-test CI job
  2. See error

Expected behavior: Loaders must exit correctly without any java.lang.AssertionError.

Actual behavior: Loaders fail.

fruch commented 1 year ago

I've raised it on java-driver: https://github.com/scylladb/java-driver/issues/213

but no one looked at it yet...