scylladb / java-driver

ScyllaDB Java Driver for ScyllaDB and Apache Cassandra, based on the DataStax Java Driver
Apache License 2.0
62 stars 37 forks source link

[c-s][3.x] driver shutdown stuck #258

Open fruch opened 11 months ago

fruch commented 11 months ago

Issue description

During the end of the stress command one node is gone (10.0.1.17), and this driver shutdown seems to be stuck

total,       4425734,   19327,   19327,   19327,     4.1,     3.9,     6.9,     9.9,    24.0,    58.8,  325.0,  0.03100,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
total,       4508016,   16456,   16456,   16456,     4.8,     4.0,     9.6,    19.7,    91.6,   218.0,  330.0,  0.03065,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
total,       4583006,   14998,   14998,   14998,     5.3,     4.5,    11.2,    18.1,    31.7,    62.0,  335.0,  0.03018,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Channel has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       4677544,   18908,   18908,   18908,     4.2,     4.0,     6.4,     9.0,    14.4,    20.2,  340.0,  0.03007,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       4780430,   20577,   20577,   20577,     3.9,     3.7,     6.6,     8.5,    13.4,    16.5,  345.0,  0.03022,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       4882713,   20457,   20457,   20457,     3.9,     3.7,     6.6,     9.1,    14.4,    26.8,  350.0,  0.03032,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       4985711,   20600,   20600,   20600,     3.9,     3.7,     6.6,     9.0,    13.9,    23.9,  355.0,  0.03040,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
total,       5048570,   20374,   20374,   20374,     3.9,     3.7,     6.7,     9.2,    19.2,    24.3,  358.1,  0.03041,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing

com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing

com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Error writing
Results:
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Op rate                   :   14,099 op/s  [WRITE: 14,099 op/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Partition rate            :   14,099 pk/s  [WRITE: 14,099 pk/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Row rate                  :   14,099 row/s [WRITE: 14,099 row/s]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency mean              :    5.6 ms [WRITE: 5.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency median            :    4.8 ms [WRITE: 4.8 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 95th percentile   :   11.6 ms [WRITE: 11.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 99th percentile   :   20.7 ms [WRITE: 20.7 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency 99.9th percentile :   41.6 ms [WRITE: 41.6 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Latency max               :  365.7 ms [WRITE: 365.7 ms]
com.datastax.driver.core.exceptions.TransportException: [ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042] Connection has been closed
Total partitions          :  5,048,570 [WRITE: 5,048,570]
WARN  00:44:50,156 Error creating netty channel to ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042
Total errors              :          0 [WRITE: 0]
com.datastax.shaded.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ip-10-0-1-17.eu-north-1.compute.internal/10.0.1.17:9042
Total GC count            : 0
Caused by: java.net.ConnectException: Connection refused
Total GC memory           : 0.000 KiB
    at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
Total GC time             :    0.0 seconds
    at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
Avg GC time               :    NaN ms
    at com.datastax.shaded.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
StdDev GC time            :    0.0 ms
    at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
Total operation time      : 00:05:58
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)

    at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
END

Impact

Stress command is getting stuck

How frequently does it reproduce?

seen once so far on the nightly runs

Installation details

Kernel Version: 5.15.0-1049-aws Scylla version (or git commit hash): 5.5.0~dev-20231113.7b08886e8dd8 with build-id 7548c48606c5b58f282fc2b596019226de0df6ed

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

OS / Image: ami-02ea62bee686ef03b (aws: undefined_region)

Test: longevity-5gb-1h-NodeTerminateAndReplace-aws-test Test id: dfb19859-9fbb-4408-a2be-a42c7400f7cb Test name: scylla-master/nemesis/longevity-5gb-1h-NodeTerminateAndReplace-aws-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor dfb19859-9fbb-4408-a2be-a42c7400f7cb` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=dfb19859-9fbb-4408-a2be-a42c7400f7cb) - Show all stored logs command: `$ hydra investigate show-logs dfb19859-9fbb-4408-a2be-a42c7400f7cb` ## Logs: - **db-cluster-dfb19859.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/db-cluster-dfb19859.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/db-cluster-dfb19859.tar.gz) - **sct-runner-events-dfb19859.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-runner-events-dfb19859.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-runner-events-dfb19859.tar.gz) - **sct-dfb19859.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-dfb19859.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-dfb19859.log.tar.gz) - **loader-set-dfb19859.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/loader-set-dfb19859.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/loader-set-dfb19859.tar.gz) - **monitor-set-dfb19859.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/monitor-set-dfb19859.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/monitor-set-dfb19859.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/nemesis/job/longevity-5gb-1h-NodeTerminateAndReplace-aws-test/83/) [Argus](https://argus.scylladb.com/test/0af3b0dc-d7dd-4a83-b906-8fa3ebb0e676/runs?additionalRuns[]=dfb19859-9fbb-4408-a2be-a42c7400f7cb)
enaydanov commented 11 months ago

Got similar problem after DecommissionStreamingErr nemesis during longevity-10gb-3h-azure-test (10.0.0.5 is a target node for this nemesis):

TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1
08:37:19,314 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback-tools.xml] at [file:/etc/scylla/cassandra/logback-tools.xml]
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs multiple times on the classpath.
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs at [file:/logback-tools.xml]
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs at [file:/etc/scylla/cassandra/logback-tools.xml]
08:37:19,412 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
08:37:19,413 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
08:37:19,420 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDERR]
08:37:19,428 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
08:37:19,489 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to WARN
08:37:19,489 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDERR] to Logger[ROOT]
08:37:19,489 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
08:37:19,490 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@2781e022 - Registering current configuration as safe fallback point

******************** Stress Settings ********************
Command:
  Type: write
  Count: -1
  Duration: 180 MINUTES

...

===== Using optimized driver!!! =====
Connected to cluster: longevity-10gb-3h-master-db-cluster-6c30fd46, max pending requests per connection null, max connections per host 8
Datatacenter: eastus; Host: /10.0.0.5; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.6; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.7; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.8; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.9; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.10; Rack: eastus
Created keyspaces. Sleeping 6s for propagation.
Sleeping 2s...
Running WRITE with 1000 threads 180 minutes
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
Failed to connect over JMX; not collecting these stats
total,        132151,   26430,   26430,   26430,    33.3,    19.7,   109.8,   185.7,   335.8,   459.8,    5.0,  0.00000,      0,      0,       0,       0,       0,       0
total,        366429,   46856,   46856,   46856,    21.3,    17.7,    51.7,    73.6,   110.0,   154.9,   10.0,  0.17570,      0,      0,       0,       0,       0,       0

...

total,     662360477,   78539,   78539,   78539,    12.7,     6.7,    45.6,    75.8,   114.6,   161.6,10800.0,  0.00426,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
total,     662377070,   63469,   63469,   63469,    14.8,     9.3,    46.0,    68.8,    89.3,   100.1,10800.3,  0.00427,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Results:
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Op rate                   :   61,330 op/s  [WRITE: 61,330 op/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Partition rate            :   61,330 pk/s  [WRITE: 61,330 pk/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Row rate                  :   61,330 row/s [WRITE: 61,330 row/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency mean              :   16.3 ms [WRITE: 16.3 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency median            :    6.8 ms [WRITE: 6.8 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 95th percentile   :   61.4 ms [WRITE: 61.4 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 99th percentile   :  149.6 ms [WRITE: 149.6 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 99.9th percentile :  488.1 ms [WRITE: 488.1 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency max               : 42312.1 ms [WRITE: 42,312.1 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total partitions          : 662,377,070 [WRITE: 662,377,070]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total errors              :          0 [WRITE: 0]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC count            : 0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC memory           : 0.000 KiB
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC time             :    0.0 seconds
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Avg GC time               :    NaN ms
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
StdDev GC time            :    0.0 ms
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
END
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response

...

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Connection has been closed

...

com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Connection has been closed

Installation details

Kernel Version: 5.15.0-1051-azure Scylla version (or git commit hash): 5.5.0~dev-20231118.6bcf3ac86c64 with build-id 61307c96d481b470546e0fd8d94a69a544c4f05c

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2023-11-18T06-26-15 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test Test id: 6c30fd46-3432-426f-824f-362eb6107b4b Test name: scylla-master/longevity/longevity-10gb-3h-azure-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 6c30fd46-3432-426f-824f-362eb6107b4b` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=6c30fd46-3432-426f-824f-362eb6107b4b) - Show all stored logs command: `$ hydra investigate show-logs 6c30fd46-3432-426f-824f-362eb6107b4b` ## Logs: - **db-cluster-6c30fd46.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/db-cluster-6c30fd46.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/db-cluster-6c30fd46.tar.gz) - **sct-runner-events-6c30fd46.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/sct-runner-events-6c30fd46.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/sct-runner-events-6c30fd46.tar.gz) - **sct-6c30fd46.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/sct-6c30fd46.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/sct-6c30fd46.log.tar.gz) - **loader-set-6c30fd46.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/loader-set-6c30fd46.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/loader-set-6c30fd46.tar.gz) - **monitor-set-6c30fd46.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/monitor-set-6c30fd46.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/monitor-set-6c30fd46.tar.gz) - **parallel-timelines-report-6c30fd46.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/parallel-timelines-report-6c30fd46.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6c30fd46-3432-426f-824f-362eb6107b4b/20231118_120057/parallel-timelines-report-6c30fd46.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-10gb-3h-azure-test/257/) [Argus](https://argus.scylladb.com/test/aa05fe67-89bd-4e30-88db-5e2b2e3d986d/runs?additionalRuns[]=6c30fd46-3432-426f-824f-362eb6107b4b)

The job is failing like this at least since build 244.

mykaul commented 10 months ago

@avelanarius - is anyone looking at this?

avelanarius commented 10 months ago

@avelanarius - is anyone looking at this?

Currently no one, with other tasks taking priority.

vponomaryov commented 10 months ago

cassandra-stress hangs after finishing it's stress work:

total,     527518456,   55284,   55284,   55284,    18.3,    12.6,    55.1,    83.6,   105.1,   116.9,10800.9,  0.11788,      0,      0,       0,       0,       0,       0
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Results:
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Op rate                   :   48,840 op/s  [WRITE: 48,840 op/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Partition rate            :   48,840 pk/s  [WRITE: 48,840 pk/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Row rate                  :   48,840 row/s [WRITE: 48,840 row/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency mean              :   20.3 ms [WRITE: 20.3 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency median            :   12.7 ms [WRITE: 12.7 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 95th percentile   :   55.2 ms [WRITE: 55.2 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 99th percentile   :   87.8 ms [WRITE: 87.8 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 99.9th percentile :  169.2 ms [WRITE: 169.2 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency max               : 90664.1 ms [WRITE: 90,664.1 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total partitions          : 527,518,456 [WRITE: 527,518,456]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total errors              :          0 [WRITE: 0]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC count            : 0
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC memory           : 0.000 KiB
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC time             :    0.0 seconds
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Avg GC time               :    NaN ms
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
StdDev GC time            :    0.0 ms
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
...

Installation details

Kernel Version: 5.15.0-1048-gcp Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 6 nodes (n2-highmem-16)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/2925864426790911027 (gce: undefined_region)

Test: longevity-10gb-3h-gce-test Test id: f4ac5c0d-4243-4afe-9c28-b9e5fe051b92 Test name: enterprise-2024.1/longevity/longevity-10gb-3h-gce-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor f4ac5c0d-4243-4afe-9c28-b9e5fe051b92` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=f4ac5c0d-4243-4afe-9c28-b9e5fe051b92) - Show all stored logs command: `$ hydra investigate show-logs f4ac5c0d-4243-4afe-9c28-b9e5fe051b92` ## Logs: - **db-cluster-f4ac5c0d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/db-cluster-f4ac5c0d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/db-cluster-f4ac5c0d.tar.gz) - **sct-runner-events-f4ac5c0d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/sct-runner-events-f4ac5c0d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/sct-runner-events-f4ac5c0d.tar.gz) - **sct-f4ac5c0d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/sct-f4ac5c0d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/sct-f4ac5c0d.log.tar.gz) - **loader-set-f4ac5c0d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/loader-set-f4ac5c0d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/loader-set-f4ac5c0d.tar.gz) - **monitor-set-f4ac5c0d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/monitor-set-f4ac5c0d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/monitor-set-f4ac5c0d.tar.gz) - **parallel-timelines-report-f4ac5c0d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/parallel-timelines-report-f4ac5c0d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f4ac5c0d-4243-4afe-9c28-b9e5fe051b92/20231218_000118/parallel-timelines-report-f4ac5c0d.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2024.1/job/longevity/job/longevity-10gb-3h-gce-test/8/) [Argus](https://argus.scylladb.com/test/e86009b6-9ee0-4615-b111-9ef255f2b518/runs?additionalRuns[]=f4ac5c0d-4243-4afe-9c28-b9e5fe051b92)
vponomaryov commented 10 months ago

One more repro here:

Installation details

Kernel Version: 5.15.0-1045-aws Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3 with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a Target Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 6 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0bca3ed0b1ef2f2fa ami-001b0989b7cf54ded (aws: undefined_region)

Test: rolling-upgrade-multidc-test Test id: 66b220bf-623f-460a-bd36-d6fa1dc496ba Test name: enterprise-2024.1/rolling-upgrade/rolling-upgrade-multidc-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 66b220bf-623f-460a-bd36-d6fa1dc496ba` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=66b220bf-623f-460a-bd36-d6fa1dc496ba) - Show all stored logs command: `$ hydra investigate show-logs 66b220bf-623f-460a-bd36-d6fa1dc496ba` ## Logs: - **db-cluster-66b220bf.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/db-cluster-66b220bf.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/db-cluster-66b220bf.tar.gz) - **sct-runner-events-66b220bf.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/sct-runner-events-66b220bf.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/sct-runner-events-66b220bf.tar.gz) - **sct-66b220bf.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/sct-66b220bf.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/sct-66b220bf.log.tar.gz) - **loader-set-66b220bf.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/loader-set-66b220bf.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/loader-set-66b220bf.tar.gz) - **monitor-set-66b220bf.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/monitor-set-66b220bf.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/66b220bf-623f-460a-bd36-d6fa1dc496ba/20231219_175710/monitor-set-66b220bf.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2024.1/job/rolling-upgrade/job/rolling-upgrade-multidc-test/8/) [Argus](https://argus.scylladb.com/test/1d5997c3-2ea5-47ca-bba5-c4edef06979a/runs?additionalRuns[]=66b220bf-623f-460a-bd36-d6fa1dc496ba)
fruch commented 10 months ago

the last report from @vponomaryov is something that is reproducible running it now with c-s debug enabled

https://argus.scylladb.com/test/7748483c-dd44-4306-9e69-97c97214f482/runs?additionalRuns[]=a242eed1-bb86-48e4-b27b-aec40cbb23db

roydahan commented 10 months ago

@avelanarius / @Bouncheck let's try to at least triage it ASAP. We have a quite easy way to reproduce it, let's try to understand where the issue is and if it's a recent regression or not.

Bouncheck commented 9 months ago

I haven't understood the root cause yet, however I have a modification in mind that may help with ensuring that shutdown happens. I'll consult @avelanarius first.

avelanarius commented 9 months ago

I relaunched Jenkins reproducer with DEBUG logs: https://jenkins.scylladb.com/job/enterprise-2024.1/job/reproducers/job/rolling-upgrade-multidc-test/5/. The run mentioned in here:

the last report from @vponomaryov is something that is reproducible running it now with c-s debug enabled https://argus.scylladb.com/test/7748483c-dd44-4306-9e69-97c97214f482/runs?additionalRuns[]=a242eed1-bb86-48e4-b27b-aec40cbb23db

didn't yield useful logs (we're not even sure that the actual problem reproduced...)

avelanarius commented 9 months ago

@Bouncheck I see that the latest run reproduced the problem correctly and now there are good DEBUG logs to try to understand the situation - file cassandra-stress-read-l0-c0-k1-2494aaab-4173-4ff4-81e0-1b805462a533.log.

and as discussed with @roydahan, please try to arrange a quick call with @fruch to make sure we're on the same page (and that we know how to conveniently run the reproducer locally etc.)

avelanarius commented 9 months ago

(and as I understand even though this is not a P0, but this affects a lot of Enterprise CI jobs so it could become urgent soon)

roydahan commented 9 months ago

@Bouncheck any progress on this one?

Bouncheck commented 9 months ago

Nothing definitive yet. I'm currently considering 2 possibilities: something is not being closed properly or/and waiting for all writes to finish (with failures) after initiating graceful shutdown takes just too long.

In the logs of the first example I've seen that one of the cassandra stress runs never outputs "END" or "FAILURE" string at any point, even though the logged times go past the set duration of the stress command. In the same run there is also a single NPE which happens when driver tries to close one connection pool at the very start. Looks like race condition because the code it points to checks for null before calling method. I don't actually know if it causes anything to hang or if it's benign yet.

In the logs of second example I can see that all cassandra stress runs technically end, but there's a lot of driver spam even after that. This is probably because all writes started before shutdown need to finish. In that specific case it's hard to tell if it's happening in reasonable timeframes due to lack of timestamps. In some logs I've seen that sometimes the shutdown can take several (e.g. 5) minutes.

In the logs of the last run I can see that in Events tab in Argus that test failed due to timeout

2024-01-08 19:47:17.500: (TestTimeoutEvent Severity.CRITICAL) period_type=not-set event_id=6b125f8c-60e1-45dc-ad15-6c21b0d15b84, Test started at 2024-01-08 13:47:17, reached it's timeout (360 minute)

and even c-s related error it seems

2024-01-08 19:48:09.758: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=c15c384e-18fb-4a72-acde-492d399ec0d1, source=UpgradeTest.test_generic_cluster_upgrade (upgrade_test.UpgradeTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 985, in test_generic_cluster_upgrade
self._run_stress_workload("stress_after_cluster_upgrade", wait_for_finish=True)
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 911, in _run_stress_workload
self.verify_stress_thread(thread_pool)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 2091, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 379, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 88, in get_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 245, in as_completed
waiter.event.wait(wait_timeout)
File "/usr/local/lib/python3.10/threading.py", line 607, in wait
signaled = self._cond.wait(timeout)
File "/usr/local/lib/python3.10/threading.py", line 324, in wait
gotit = waiter.acquire(True, timeout)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 263, in critical_failure_handler
raise CriticalTestFailure("Critical Error has failed the test")  # pylint: disable=raise-missing-from
sdcm.tester.CriticalTestFailure: Critical Error has failed the test

But all c-s logs from loader-set do have "END" message and timestamps do not match with the timeout. There is one suspicious c-s log (cassandra-stress-read-l1-c0-k1-bffede14-8924-4da1-9644-3876601c74e9.log) which does not end with "freed threads" lines. Looks like it ends abruptly but its last timestamped line has timestamp 16:10:57,050 while timeout reported in Argus happens at 19:47:17.500. I am assuming the clocks are not off by more than 1 hour because earliest stress logs I've found were in minutes proximity to the start of the whole test. The times also match the reads and writes charts. I am not sure how to explain this one.

avelanarius commented 8 months ago

To expedite fixing the problem, @Bouncheck will try:

@roydahan fyi: @Bouncheck was/is in contact with @fruch to figure out the problem and ask him about SCT/this issue

roydahan commented 8 months ago

@fruch / @Bouncheck I think that the longevity-10gb-3h-gce-test is one of the consistent jobs to reproduce it. @fruch maybe we can try to reproduce it with a shorter duration.

tcpdump of the entire test is not feasible in terms of size, so we need to find the correct spot to start it and end it.

fruch commented 8 months ago

To expedite fixing the problem, @Bouncheck will try:

  • Rerunning the job with higher timeout (rather than the current +5%)

this can be changed in: https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/stress_thread.py#L365

  • Maybe adding even more logging (with custom driver)

supply us a fork/branch with those, and we can build a docker image, that we can use for reproducers of this issue

  • Maybe running tcpdump (in one run we saw one connection still having a lot of inflight requests which we don't know where they originated from)
  • Maybe running many runs in parallel to make sure our iteration time is faster and we hit the problem more often

the cases it's happens the most are: longevity-10gb-3h-azure-test and longevity-10gb-3h-gce-test

I don't mind changing the c-s image in those two, until we can figure this out

@roydahan fyi: @Bouncheck was/is in contact with @fruch to figure out the problem and ask him about SCT/this issue

Bouncheck commented 8 months ago

I have trouble understanding some of the logs. For example in this run: https://jenkins.scylladb.com/job/scylla-5.4/job/longevity/job/longevity-10gb-3h-gce-test/9/ In the loader-set-3e9f3337/cassandra-stress-write-l0-c0-k1-d967395a-8701-4848-9d4c-5225146c1a96.log I can see the following lines:

19:51:31,810 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@76a3e297 - Registering current configuration as safe fallback point

******************** Stress Settings ********************
Command:
  Type: write
  Count: -1
  Duration: 180 MINUTES
  No Warmup: true
  Consistency Level: QUORUM
[...]
  WARN  19:53:44,446 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-6.c.sct-project-1.internal/10.142.0.108:9042
[...]
  Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
WARN  20:42:09,958 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042

[...] stands in for all in-between logs. What confuses me about those logs is that stress command which seems to be configured for 180 minutes and reports to have 3h total running time at the end is enclosed between timestamps which are less than 1h apart, assuming timestamps are correct. Is the total running time something different than how long does the command run? Is there an easy way to confirm at what times the cluster was under load?

fruch commented 8 months ago

I have trouble understanding some of the logs. For example in this run: https://jenkins.scylladb.com/job/scylla-5.4/job/longevity/job/longevity-10gb-3h-gce-test/9/ In the loader-set-3e9f3337/cassandra-stress-write-l0-c0-k1-d967395a-8701-4848-9d4c-5225146c1a96.log I can see the following lines:

19:51:31,810 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@76a3e297 - Registering current configuration as safe fallback point

******************** Stress Settings ********************
Command:
  Type: write
  Count: -1
  Duration: 180 MINUTES
  No Warmup: true
  Consistency Level: QUORUM
[...]
  WARN  19:53:44,446 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-6.c.sct-project-1.internal/10.142.0.108:9042
[...]
  Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
WARN  20:42:09,958 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042

[...] stands in for all in-between logs. What confuses me about those logs is that stress command which seems to be configured for 180 minutes and reports to have 3h total running time at the end is enclosed between timestamps which are less than 1h apart, assuming timestamps are correct. Is the total running time something different than how long does the command run? Is there an easy way to confirm at what times the cluster was under load?

Yes downloading the monitoring data and looking at the metics would help with that

hydra investigate show-monitor f4ac5c0d-4243-4afe-9c28-b9e5fe051b92

You just need to find the relevant test_id for the run you want to check out

roydahan commented 6 months ago

@Bouncheck have you made any progress on that? Next week could be a good opportunity to sit together with someone from @fruch team (e.g. @soyacz ) and try debugging it together.

Bouncheck commented 6 months ago

Not really, but I'm not out of options yet. Sounds like a good idea, I'll reach out.

fruch commented 1 month ago

@dkropachev

any idea how to push this on forward, it's happening on almost daily basis on SCT short longevity

dkropachev commented 1 month ago

@fruch , we will take care of it on next week.

juliayakovlev commented 2 weeks ago

Reproduced with 2024.3.0~dev-20241007

Image

Packages

Scylla version: 2024.3.0~dev-20241007.887201968d48 with build-id aadf81f92d9cf1e67f821457d46de618e97c8969

Kernel Version: 6.8.0-1016-aws

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-005d075bf3a6cdfd3 (aws: undefined_region)

Test: scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes Test id: 4a856b2e-8a6c-4195-93c3-b7ef2881fb1f Test name: scylla-enterprise/perf-regression/scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes Test method: performance_regression_gradual_grow_throughput.PerformanceRegressionPredefinedStepsTest.test_mixed_gradual_increase_load Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 4a856b2e-8a6c-4195-93c3-b7ef2881fb1f` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=4a856b2e-8a6c-4195-93c3-b7ef2881fb1f) - Show all stored logs command: `$ hydra investigate show-logs 4a856b2e-8a6c-4195-93c3-b7ef2881fb1f` ## Logs: - **db-cluster-4a856b2e.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/db-cluster-4a856b2e.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/db-cluster-4a856b2e.tar.gz) - **sct-runner-events-4a856b2e.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/sct-runner-events-4a856b2e.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/sct-runner-events-4a856b2e.tar.gz) - **sct-4a856b2e.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/sct-4a856b2e.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/sct-4a856b2e.log.tar.gz) - **loader-set-4a856b2e.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/loader-set-4a856b2e.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/loader-set-4a856b2e.tar.gz) - **monitor-set-4a856b2e.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/monitor-set-4a856b2e.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a856b2e-8a6c-4195-93c3-b7ef2881fb1f/20241008_181537/monitor-set-4a856b2e.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-enterprise/job/perf-regression/job/scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes/9/) [Argus](https://argus.scylladb.com/test/5c8777b4-9bf1-49bc-8b96-fa3426b05e86/runs?additionalRuns[]=4a856b2e-8a6c-4195-93c3-b7ef2881fb1f)
roydahan commented 2 weeks ago

@Bouncheck do we have any progress with this issue? (I think have 2 additional ones assigned to you that we believe are being caused by the same root cause).

Bouncheck commented 2 weeks ago

I've found that variation where c-s stops producing load is reproducible locally. The variation where c-s fails to shutdown correctly and hangs indefinitely I haven't managed to get it reproduced locally with the dtest that was reported to fail.

I'm proceeding with trying to solve the first variation, but I'm not sure if the root cause will be the same. One of the scenarios under which writes stop is a situation where cluster gets overloaded, causing writes to fail 10 times (where 10 is default number of retries). The conditions are that 1 of 3 nodes is down, queries require quorum consistency, but the quorum is still alive. Even if the third node is brought back the driver will reconnect to it, but c-s will still not produce queries, so I'm looking in c-s itself if there is a flaw in error handling there.

Bouncheck commented 1 day ago

I think this one is not actually caused by the java-driver but something on SCT side, unless I'm missing something. This would also explain why it seems impossible to reproduce locally. Since I couldn't reproduce it I've been working with a copy of longevity-10gb-3h-gce-test on jenkins which seems to fail consistently. Every time I would manually inspect the loader nodes, the stress process inside docker container would always end successfully without much delay. The process would not be listed within a minute past duration when running jps, ps aux as a root or top. I've added echo with exit code right after the stress command and it would output 0 to a file every time. In the runs that fail the docker container remains alive after c-s finishing and when the test ends successfully the docker container is gone rather quickly. I'm not sure what component fails there (if any). I'm working on pinpointing the cause.

soyacz commented 15 hours ago

I tried to investigate with @Bouncheck and we got to conclusions:

  1. SCT has delay in returning from ended process in case there's a lot of lines in stderr - around 10 minutes - which is enough to fail with soft timeout. E.g. in test: 4f4905a4-35a0-4b29-b32b-e5b9b4a1eb95 we can see sct got log lines :
    < t:2024-10-27 10:32:01,796 f:base.py         l:231  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.142.0.60>: WARN  10:21:19,793 Error creating netty channel to longevity-10gb-3h-6-2-modi-db-node-4f4905a4-0-5.c.sct-project-1.internal/10.142.0.55:9042

    It also looks like c-s printed these errors after test end as c-s logs from loader looks also as delayed (log from 10:04 was after END message while c-s started around 10:02). Need to investigate this issue in SCT code.

  2. c-s has a problem with reconnecting to a node that ran multiple hard reboot nemesis - it can't connect to it throughout whole test after it. This will be investigated further by @Bouncheck.
soyacz commented 14 hours ago

one of my culprit candidate is that there's too much load on sct runner and it can't keep up with logs collection in realtime: Image Basically, python being bound to 1 cpu, on 4 core machine we can reach max 25% - and we are above that (maybe we can be bit higher than 25% because we use more sub-processes for logs parsing) In other tests sct-runner is mostly idle. Example from sanity 4h test: Image

vponomaryov commented 13 hours ago

@soyacz

All of it sounds to me that we should have logs processing in some perf-oriented lang like rust. Or in golang.

Because it is not first problem related to the high load of logs processing in python.

So, first step - look for existing solutions, second - try it out and third, probably, do our own if none of existent suits our needs.

@fruch , @roydahan WDYT?

fruch commented 12 hours ago

@soyacz

All of it sounds to me that we should have logs processing in some perf-oriented lang like rust. Or in golang.

Because it is not first problem related to the high load of logs processing in python.

So, first step - look for existing solutions, second - try it out and third, probably, do our own if none of existent suits our needs.

@fruch , @roydahan WDYT?

I'm not yet convinced it's the processing of the test end,

Since we are running it via the docker command line, via ssh, it might slow us as well.

Anyhow we don't expect the stress tool to be printing at errors that high rate, it doesn't help us at all.

Also once we are done with the stress, and stopping the session, we don't care any of those errors anymore