Closed fruch closed 11 months ago
@fruch is there a priority label or triage label in this repo?
No, but any labels you want can be added.
@fruch is there a priority label or triage label in this repo?
have you seen this happening in more places ?
No, just wondered if this is high priority issue or not (since no one really tracks the issues in this repo for now).
No, just wondered if this is high priority issue or not (since no one really tracks the issues in this repo for now).
that code is there for ages, I've seen it once on an SCT run I'm guessing it happens more times but we didn't noticed/digged into the actual reason.
it technically might happen in any query that changes schema.
seen it again in SCT:
2023-09-30 03:57:53.448: (DisruptionEvent Severity.ERROR) period_type=end event_id=61561cd1-abb2-4ff9-8b5d-f09bdb4574bf duration=19s: nemesis_name=ToggleTableGcMode target_node=Node longevity-parallel-topology-schema--db-node-b4eba44d-2 [54.74.133.187 | 10.4.11.167] (seed: True) errors=Encountered a bad command exit code!
Command: '/usr/bin/cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "ALTER TABLE scylla_bench.test WITH tombstone_gc = {\'mode\': \'repair\'};" 10.4.11.167 9042'
Exit code: 2
Stdout:
Stderr:
<stdin>:1:Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5029, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2722, in disrupt_toggle_table_gc_mode
self.toggle_table_gc_mode()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2421, in toggle_table_gc_mode
self.target_node.run_cqlsh(cmd)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2852, in run_cqlsh
cqlsh_out = self.remoter.run(cmd, timeout=timeout + 120, # we give 30 seconds to cqlsh timeout mechanism to work
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: '/usr/bin/cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "ALTER TABLE scylla_bench.test WITH tombstone_gc = {\'mode\': \'repair\'};" 10.4.11.167 9042'
Exit code: 2
Stdout:
Stderr:
<stdin>:1:Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.
Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 5.4.0~dev-20230927.0f22e8d196af
with build-id 2c911e6e2b12c7d0c19f67f76711e0c1adfea3cb
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-0c25786faf310fa10
(aws: undefined_region)
Test: longevity-schema-topology-changes-12h-test
Test id: b4eba44d-f499-4758-b9c6-2c7f9304b2df
Test name: scylla-master/longevity/longevity-schema-topology-changes-12h-test
Test config file(s):
in SCT we might have schema changes during other nemesis that uses cqlsh, and it's fine to not have schema agreement at that point
in SCT we might have schema changes during other nemesis that uses cqlsh, and it's fine to not have schema agreement at that point
if that is the case, then when we do have schema changes, it should enable a filter to ignore (and perhaps add retries) to these CQL commands, but the commands themselves should fail, if without schema changes we do hit these schema disagreements
cqlsh can warn about it, it shouldn't fail
SCT shouldn't be failing cause of wanings
@fruch I just saw you have a PR for fixing it.
2023.1.4
Nemesis Information
Class: Sisyphus
Name: disrupt_toggle_table_ics
Status: Failed
Failure reason
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5062, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2784, in disrupt_toggle_table_ics
self.toggle_table_ics()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2524, in toggle_table_ics
raise unexpected_exit
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2519, in toggle_table_ics
self.target_node.run_cqlsh(cmd)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2744, in run_cqlsh
cqlsh_out = self.remoter.run(cmd, timeout=timeout + 120, # we give 30 seconds to cqlsh timeout mechanism to work
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: '/usr/bin/cqlsh --no-color --request-timeout=120 --connect-timeout=60 -e "ALTER TABLE scylla_bench.test WITH compaction = {\'class\': \'IncrementalCompactionStrategy\', \'bucket_high\': 1.5, \'bucket_low\': 0.42, \'min_sstable_size\': 422, \'min_threshold\': 5, \'max_threshold\': 35, \'sstable_size_in_mb\': 352};" 10.12.11.74'
Exit code: 2
Stdout:
Stderr:
<stdin>:1:Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.
Describe your issue in detail and steps it took to produce it.
Describe the impact this issue causes to the user.
Describe the frequency with how this issue can be reproduced.
Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 2023.1.4-20240116.bedad080681e
with build-id c0c543b5c81473e26e5111d8f379a77b786bc450
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-001477f22cadc6d9c
(aws: undefined_region)
Test: longevity-schema-topology-changes-12h-test
Test id: 73ec8004-3881-4cbb-b37b-11c75c1c1126
Test name: enterprise-2023.1/longevity/longevity-schema-topology-changes-12h-test
Test config file(s):
on any case of statment return with
is_schema_agreed == False
, code is doing waiting for hardcode 5sec for schema agreement, and usesself.printerr
which would change the return code of the whole cqlsh command failure (2)in this situation every time cluster would be in disagreement the automation using this command would assume it's a failure. anyhow the code is doing refresh without timeout right after the failure, so there's no much point of doing setting the error code