READ cs-workload didn't recover after rollback/upgrade completed: Failed to create client too many times

amoskong commented 5 years ago

Installation details Scylla version (or git commit hash): from 3.0.10-0.20190815.b3bfd8c08 to 3.1.0.rc4-0.20190829.d70c2db09 Cluster size: 4 OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 16.04

Description: In upgrade test, we upgraded two db nodes first, then start a READ workload in background. Rollback the last upgrade node (rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-1), and upgrade rest (3) nodes to latest. In this job, the second upgraded node is the seed node, so the rollbacked node is the seed node. The READ workload touched TransportException as expected when rollback is started, but it didn't recover when the rollback is completed, event all nodes are upgraded. I see many Failed to create client too many times error from c-s.
Logs:

< t:2019-08-30 11:26:58,996 f:upgrade_test.py l:439  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-3 begin
< t:2019-08-30 11:31:06,920 f:upgrade_test.py l:441  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-3 ended
< t:2019-08-30 11:55:09,075 f:upgrade_test.py l:464  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-1 begin
< t:2019-08-30 12:00:15,083 f:upgrade_test.py l:466  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-1 ended
< t:2019-08-30 12:04:50,255 f:upgrade_test.py l:478  c:sdcm.tester          p:INFO  > Rollback Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-1 begin

< t:2019-08-30 12:05:23,321 f:cluster.py      l:1186 c:sdcm.cluster         p:DEBUG > 2019-08-30T12:05:22+00:00  rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-2 !INFO    | scylla:  [shard 0] rpc - client
 10.142.0.13: fail to connect: Connection refused

< t:2019-08-30 12:05:23,321 f:cluster.py      l:1186 c:sdcm.cluster         p:DEBUG > 2019-08-30T12:05:22+00:00  rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-2 !INFO    | scylla:  [shard 0] rpc - client 10.142.0.13: fail to connect: Connection refused

com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.142.0.13:9042 (com.datastax.driver.core.exceptions.TransportException: [/10.142.0.13:9042] Cannot connect))

< t:2019-08-30 12:05:24,288 f:remote.py       l:662  c:sdcm.remote          p:INFO  > RemoteCmdRunner [scylla-test@10.142.0.36]: Failed to create client too many times

< t:2019-08-30 12:06:27,623 f:upgrade_test.py l:480  c:sdcm.tester          p:INFO  > Rollback Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-1 ended
< t:2019-08-30 12:06:37,485 f:upgrade_test.py l:485  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-1 begin
< t:2019-08-30 12:10:03,422 f:upgrade_test.py l:487  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-1 ended
< t:2019-08-30 12:10:13,206 f:upgrade_test.py l:485  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-4 begin
< t:2019-08-30 12:15:11,351 f:upgrade_test.py l:487  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-4 ended
< t:2019-08-30 12:15:18,638 f:upgrade_test.py l:485  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-2 begin
< t:2019-08-30 12:20:19,739 f:upgrade_test.py l:487  c:sdcm.tester          p:INFO  > Upgrade Node rolling-upgrade-upgrade--ubuntu-xen-db-node-aa815381-0-2 ended

Results:
Op rate                   :        0 op/s  []
Partition rate            :        0 pk/s  []
Row rate                  :        0 row/s []
Latency mean              :    0.0 ms []
Latency median            :    0.0 ms []
Latency 95th percentile   :    0.0 ms []
Latency 99th percentile   :    0.0 ms []
Latency 99.9th percentile :    0.0 ms []
Latency max               :    0.0 ms []
Total partitions          :          0 []
Total errors              :          0 []
Total GC count            : 0
Total GC memory           : 0.000 KiB
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:20:00

java.lang.RuntimeException: Failed to execute stress action
2019-08-30 12:25:26,489  FAILURE
        at org.apache.cassandra.stress.StressAction.run(StressAction.java:100)
        at org.apache.cassandra.stress.Stress.run(Stress.java:143)
        at org.apache.cassandra.stress.Stress.main(Stress.java:62)

amoskong commented 5 years ago

The seed nodes are drained for rollback, READ workload was started almost at same time.

Key timestamp:

2019-08-30 12:04:50,428  [scylla-test@10.142.0.13]: Running command "nodetool -u cassandra -pw cassandra  drain "...
- drain the seed node (which will be rollbacked soon)

2019-08-30 12:04:51,608 f:sct_events.py   l:452  c:sdcm.sct_events      p:INFO  > stress_cmd=cassandra-stress read no-warmup cl=QUORUM duration=20m -schema keyspace=keyspace1 'replication(factor=3) compressio
n=LZ4Compressor' -port jmx=6868 -mode cql3 native compression=lz4  user=cassandra password=cassandra -rate threads=1000 -pop seq=1..10000000 -log interval=5 -node 10.142.0.13
- starting read workload

2019-08-30 12:05:05,191   Running READ with 1000 threads 20 minutes
- workload started

2019-08-30 12:05:22,149  [scylla-test@10.142.0.13]  "sudo systemctl start scylla-server.service"...
- start rollbacked node (node1, seed)

amoskong commented 5 years ago

Database log of seed node: (I didn't find any special error/exception in seed node)

Rollback seed node after first upgrade: rollbacked-node-log.txt
Re-Upgrade seed node after Rollback: reupgrade-node-log.txt

amoskong commented 5 years ago

This issue occurred on Debian 9.

Scylla version: 3.1.0.rc5-0.20190902.623ea5e3d

roydahan commented 5 years ago

@amoskong does this happen only when the seed node is the one that is the one that being rollbacked and upgraded?

Maybe it's related to the fact we run the c-s command with --node [IP_OF_SEED_NODE] and this node is the one that is being drained at the same time... The c-s needs this node up and running at least till it starts doing I/O.

amoskong commented 5 years ago

On Tue, Sep 3, 2019 at 9:16 PM Roy Dahan notifications@github.com wrote:

@amoskong https://github.com/amoskong does this happen only when the seed node is the one that is the one that being rollbacked and upgraded?

Yes.

Maybe it's related to the fact we run the c-s command with --node [IP_OF_SEED_NODE] and this node is the one that is being drained at the same time...

Yes.

The c-s needs this node up and running at least till it starts doing I/O.

Is it a real issue? do we need to fix our test to avoid this situation?

roydahan commented 5 years ago

If this is the case, it's not a real issue. You need to wait for the c-s to start and only then start the rollback process (including the drain). If this solves the issue, we can close this one.

slivne commented 5 years ago

@roydahan / @amoskong can we close this ?

amoskong commented 5 years ago

I didn't reproduce it after wait a while before start rollback, so we can close this ticket. I can reopen it if I saw it in future.

scylladb / scylladb

READ cs-workload didn't recover after rollback/upgrade completed: Failed to create client too many times #4933