Closed ShlomiBalalis closed 2 years ago
@ShlomiBalalis I don't see the problem with this nemesis. For me it looks like some bug in Scylla. Similar to: https://github.com/scylladb/scylla-manager/issues/2830
In loader logs, errors starts at 17:00:17.123
, so there is like 3 minutes delay between logs in sct and loader logs.
Also I've looked at db logs - strangely, at given times, there's no information about repair happening in nodes 1 and 3 (I've checked only 1, 2 and 3) - shouldn't be there any sign of repair?
This is not scylla issue, this is result of the following:
< t:2022-01-11 10:12:08,615 f:base.py l:142 c:RemoteCmdRunner p:DEBUG > Command "cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "describe system_auth" 10.0.0.159 9042" finished with status 0
< t:2022-01-11 10:12:08,615 f:remote_base.py l:520 c:RemoteCmdRunner p:DEBUG > Running command "cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'Using /etc/scylla/scylla.yaml as the config file': 0, 'eu-west': 3}" 10.0.0.159 9042"...
< t:2022-01-11 10:12:09,061 f:events_processes.py l:146 c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/jenkins/slave/workspace/can_scylla-cluster-tests_PR-4310/scylla-cluster-tests/20220111-094703-906137,id=0x7fb99b458610,default=True]
Which is result of garbage licking out thru get_nodetool_status
"ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'Using /etc/scylla/scylla.yaml as the config file': 0, 'eu-west': 3}"
This is not scylla issue, this is result of the following:
< t:2022-01-11 10:12:08,615 f:base.py l:142 c:RemoteCmdRunner p:DEBUG > Command "cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "describe system_auth" 10.0.0.159 9042" finished with status 0 < t:2022-01-11 10:12:08,615 f:remote_base.py l:520 c:RemoteCmdRunner p:DEBUG > Running command "cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'Using /etc/scylla/scylla.yaml as the config file': 0, 'eu-west': 3}" 10.0.0.159 9042"... < t:2022-01-11 10:12:09,061 f:events_processes.py l:146 c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/jenkins/slave/workspace/can_scylla-cluster-tests_PR-4310/scylla-cluster-tests/20220111-094703-906137,id=0x7fb99b458610,default=True]
Which is result of garbage licking out thru
get_nodetool_status
Thanks @dkropachev, this is very helpful, now I see the issue. But strange it passed verification of scylla
I thought they remove those prints, as part of fixing: https://github.com/scylladb/scylla-tools-java/issues/213
My investigation led to conclusion, that when e.g. system_traces keyspace is using SimpleStrategy and we switch it to NetworkStrategy with changed replication factor, then we must do full repair (like stated in https://docs.scylladb.com/operating-scylla/procedures/cluster-management/update-topology-strategy-from-simple-to-network/#nodes-are-on-different-racks), which nemesis is not doing. Fixing.
@ShlomiBalalis , fix was merged, could you please confirm if you the issue is now fixed (and close this issue)?
ping @ShlomiBalalis
@ShlomiBalalis ?
Yeah, the nemesis passes now. Closing.
Prerequisites
Versions
SCT: master
scylla: master
test_id: e3f95d0c-bd9d-4e68-887e-3fb2374c597a
https://jenkins.scylladb.com/view/master/job/scylla-master/job/longevity/job/longevity-200gb-48h/158/
Logs:
Description
At
2021-12-31 16:34:38.952
a AddRemoveDc nemesis began. At first, there were no appearant issues. At2021-12-31 16:57:47,997
, as part of the nemesis, a repair on each of the nodes has begun:...
...
...
...
During the repair of node#5, the cassandra stress thread of the longevity has started to time out with no apparent reason from the nodes' logs: