AddRemoveDc caused a cassandra stress timeout

ShlomiBalalis commented 2 years ago

Prerequisites

[x] Are you rebased to master ?
[ ] Is it reproducible ?
[x] Did you perform a cursory search if this issue isn't opened ?

Versions

SCT: master
scylla: master
test_id: e3f95d0c-bd9d-4e68-887e-3fb2374c597a
https://jenkins.scylladb.com/view/master/job/scylla-master/job/longevity/job/longevity-200gb-48h/158/

Logs:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                           Log links for testrun with test id e3f95d0c-bd9d-4e68-887e-3fb2374c597a                                                                                                            |
+-----------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Date            | Log type    | Link                                                                                                                                                                                                                                                         |
+-----------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 20211231_172124 | grafana     | https://cloudius-jenkins-test.s3.amazonaws.com/e3f95d0c-bd9d-4e68-887e-3fb2374c597a/20211231_172124/grafana-screenshot-longevity-200gb-48h-scylla-per-server-metrics-nemesis-20211231_172350-longevity-200gb-48h-verify-limited--monitor-node-e3f95d0c-1.png |
| 20211231_172124 | grafana     | https://cloudius-jenkins-test.s3.amazonaws.com/e3f95d0c-bd9d-4e68-887e-3fb2374c597a/20211231_172124/grafana-screenshot-overview-20211231_172124-longevity-200gb-48h-verify-limited--monitor-node-e3f95d0c-1.png                                              |
| 20220102_084604 | db-cluster  | https://cloudius-jenkins-test.s3.amazonaws.com/e3f95d0c-bd9d-4e68-887e-3fb2374c597a/20220102_084604/db-cluster-e3f95d0c.tar.gz                                                                                                                               |
| 20220102_084604 | loader-set  | https://cloudius-jenkins-test.s3.amazonaws.com/e3f95d0c-bd9d-4e68-887e-3fb2374c597a/20220102_084604/loader-set-e3f95d0c.tar.gz                                                                                                                               |
| 20220102_084604 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/e3f95d0c-bd9d-4e68-887e-3fb2374c597a/20220102_084604/monitor-set-e3f95d0c.tar.gz                                                                                                                              |
| 20220102_084604 | sct         | https://cloudius-jenkins-test.s3.amazonaws.com/e3f95d0c-bd9d-4e68-887e-3fb2374c597a/20220102_084604/sct-runner-e3f95d0c.tar.gz                                                                                                                               |
+-----------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Description

At 2021-12-31 16:34:38.952 a AddRemoveDc nemesis began. At first, there were no appearant issues. At 2021-12-31 16:57:47,997, as part of the nemesis, a repair on each of the nodes has begun:

2021-12-31 16:57:47,997 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr "...
2021-12-31 16:59:59,880 f:base.py         l:142  c:RemoteCmdRunner      p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr " finished with status 0
2021-12-31 16:59:59,882 f:cluster.py      l:2565 c:sdcm.cluster_aws     p:DEBUG > Node longevity-200gb-48h-verify-limited--db-node-e3f95d0c-1 [18.202.236.169 | 10.0.0.242] (seed: True): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr ' duration -> 131.8823548979999 s

...

2021-12-31 16:59:59,891 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr "...
2021-12-31 17:01:56,094 f:base.py         l:142  c:RemoteCmdRunner      p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr " finished with status 0
2021-12-31 17:01:56,096 f:cluster.py      l:2565 c:sdcm.cluster_aws     p:DEBUG > Node longevity-200gb-48h-verify-limited--db-node-e3f95d0c-2 [3.251.68.137 | 10.0.3.96] (seed: False): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr ' duration -> 116.20317327899829 s

...

2021-12-31 17:01:56,097 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr "...
2021-12-31 17:03:33,073 f:base.py         l:142  c:RemoteCmdRunner      p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr " finished with status 0
2021-12-31 17:03:33,074 f:cluster.py      l:2565 c:sdcm.cluster_aws     p:DEBUG > Node longevity-200gb-48h-verify-limited--db-node-e3f95d0c-4 [54.195.151.123 | 10.0.1.17] (seed: False): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr ' duration -> 96.9755406300028 s

...

2021-12-31 17:03:33,082 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr "...
2021-12-31 17:04:30,245 f:base.py         l:142  c:RemoteCmdRunner      p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr " finished with status 0
2021-12-31 17:04:30,245 f:cluster.py      l:2565 c:sdcm.cluster_aws     p:DEBUG > Node longevity-200gb-48h-verify-limited--db-node-e3f95d0c-5 [3.250.5.131 | 10.0.2.150] (seed: False): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr ' duration -> 57.072066680004355 s

...

2021-12-31 17:04:30,345 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr "...
2021-12-31 17:06:14,366 f:base.py         l:142  c:RemoteCmdRunner      p:DEBUG > Command "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr " finished with status 0
2021-12-31 17:06:14,367 f:cluster.py      l:2565 c:sdcm.cluster_aws     p:DEBUG > Node longevity-200gb-48h-verify-limited--db-node-e3f95d0c-6 [54.171.151.39 | 10.0.3.38] (seed: False): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair -pr ' duration -> 104.01899398000387 s

During the repair of node#5, the cassandra stress thread of the longevity has started to time out with no apparent reason from the nodes' logs:

2021-12-31 17:03:22.883: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=451101 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [30353232394d4f354c30]: Error executing: (OperationTimedOutException): [10.0.0.242/10.0.0.242:9042] Timed out waiting for server response
2021-12-31 17:03:24.481: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=451147 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [4d4d3032313450363630]: Error executing: (OperationTimedOutException): [10.0.0.242/10.0.0.242:9042] Timed out waiting for server response
2021-12-31 17:03:24.691: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=faeef4ed-b324-41e7-9e1b-579b1452f95e duration=11h6m29s: node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
stress_cmd=cassandra-stress write cl=QUORUM duration=2860m  -schema 'replication(factor=3) compaction(strategy=LeveledCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=250  -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=400200300..600200300 -log interval=15
errors:

Stress command completed with bad status -1: Failed to connect over JMX; not collecting these stats
Failed to connect over JMX; not collecting th
2021-12-31 17:03:24.692: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=451157 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [394e5032363534323431]: Error executing: (OperationTimedOutException): [10.0.0.242/10.0.0.242:9042] Timed out waiting for server response
2021-12-31 17:03:56.790: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=451994 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [3138313430384b503330]: Error executing: (OperationTimedOutException): [10.0.3.96/10.0.3.96:9042] Timed out waiting for server response
2021-12-31 17:03:56.895: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=452002 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [39314e31323438374f30]: Error executing: (OperationTimedOutException): [/10.0.2.150:9042] Timed out waiting for server response
2021-12-31 17:03:56.902: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=452010 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [374b313436364f363330]: Error executing: (OperationTimedOutException): [10.0.3.96/10.0.3.96:9042] Timed out waiting for server response
2021-12-31 17:03:57.198: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=452021 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [4f30374b34384e4f3430]: Error executing: (OperationTimedOutException): [10.0.1.17/10.0.1.17:9042] Timed out waiting for server response
2021-12-31 17:04:15.652: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=452514 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [4b3334343931334e3930]: Error executing: (OperationTimedOutException): [/10.0.2.150:9042] Timed out waiting for server response
2021-12-31 17:04:18.651: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=452623 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [3232364e373030374e30]: Error executing: (TransportException): [10.0.3.96/10.0.3.96:9042] Connection has been closed
2021-12-31 17:04:18.853: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=452631 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)
java.io.IOException: Operation x10 on key(s) [503235334c33354b4c30]: Error executing: (TransportException): [10.0.3.96/10.0.3.96:9042] Connection has been closed
2021-12-31 17:04:18.856: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=fbf613be-0782-46a1-9acf-cf08500e3784: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=452639 node=Node longevity-200gb-48h-verify-limited--loader-node-e3f95d0c-1 [3.250.35.142 | 10.0.3.216] (seed: False)

soyacz commented 2 years ago

@ShlomiBalalis I don't see the problem with this nemesis. For me it looks like some bug in Scylla. Similar to: https://github.com/scylladb/scylla-manager/issues/2830 In loader logs, errors starts at 17:00:17.123, so there is like 3 minutes delay between logs in sct and loader logs. Also I've looked at db logs - strangely, at given times, there's no information about repair happening in nodes 1 and 3 (I've checked only 1, 2 and 3) - shouldn't be there any sign of repair?

dkropachev commented 2 years ago

This is not scylla issue, this is result of the following:

< t:2022-01-11 10:12:08,615 f:base.py         l:142  c:RemoteCmdRunner      p:DEBUG > Command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe system_auth" 10.0.0.159 9042" finished with status 0
< t:2022-01-11 10:12:08,615 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'Using /etc/scylla/scylla.yaml as the config file': 0, 'eu-west': 3}" 10.0.0.159 9042"...
< t:2022-01-11 10:12:09,061 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/jenkins/slave/workspace/can_scylla-cluster-tests_PR-4310/scylla-cluster-tests/20220111-094703-906137,id=0x7fb99b458610,default=True]

Which is result of garbage licking out thru get_nodetool_status

soyacz commented 2 years ago

"ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'Using /etc/scylla/scylla.yaml as the config file': 0, 'eu-west': 3}"

This is not scylla issue, this is result of the following:

< t:2022-01-11 10:12:08,615 f:base.py         l:142  c:RemoteCmdRunner      p:DEBUG > Command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "describe system_auth" 10.0.0.159 9042" finished with status 0
< t:2022-01-11 10:12:08,615 f:remote_base.py  l:520  c:RemoteCmdRunner      p:DEBUG > Running command "cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'Using /etc/scylla/scylla.yaml as the config file': 0, 'eu-west': 3}" 10.0.0.159 9042"...
< t:2022-01-11 10:12:09,061 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/jenkins/slave/workspace/can_scylla-cluster-tests_PR-4310/scylla-cluster-tests/20220111-094703-906137,id=0x7fb99b458610,default=True]

Which is result of garbage licking out thru get_nodetool_status

Thanks @dkropachev, this is very helpful, now I see the issue. But strange it passed verification of scylla

fruch commented 2 years ago

I thought they remove those prints, as part of fixing: https://github.com/scylladb/scylla-tools-java/issues/213

soyacz commented 2 years ago

My investigation led to conclusion, that when e.g. system_traces keyspace is using SimpleStrategy and we switch it to NetworkStrategy with changed replication factor, then we must do full repair (like stated in https://docs.scylladb.com/operating-scylla/procedures/cluster-management/update-topology-strategy-from-simple-to-network/#nodes-are-on-different-racks), which nemesis is not doing. Fixing.

fgelcer commented 2 years ago

@ShlomiBalalis , fix was merged, could you please confirm if you the issue is now fixed (and close this issue)?

fgelcer commented 2 years ago

ping @ShlomiBalalis

fgelcer commented 2 years ago

@ShlomiBalalis ?

ShlomiBalalis commented 2 years ago

Yeah, the nemesis passes now. Closing.

scylladb / scylla-cluster-tests