scylladb / scylla-cqlsh

A fork of the cqlsh code
Apache License 2.0
16 stars 32 forks source link

cqlsh shouldn't be returning failure on cases of schema disagreement #34

Closed fruch closed 11 months ago

fruch commented 1 year ago

on any case of statment return with is_schema_agreed == False, code is doing waiting for hardcode 5sec for schema agreement, and uses self.printerr which would change the return code of the whole cqlsh command failure (2)

        # Even if statement failed we try to refresh schema if not agreed (see CASSANDRA-9689)
        if not future.is_schema_agreed:
            try:
                self.conn.refresh_schema_metadata(5)  # will throw exception if there is a schema mismatch
            except Exception:
                self.printerr("Warning: schema version mismatch detected; check the schema versions of your "
                              "nodes in system.local and system.peers.")
                self.conn.refresh_schema_metadata(-1)

in this situation every time cluster would be in disagreement the automation using this command would assume it's a failure. anyhow the code is doing refresh without timeout right after the failure, so there's no much point of doing setting the error code

roydahan commented 1 year ago

@fruch is there a priority label or triage label in this repo?

fruch commented 1 year ago

No, but any labels you want can be added.

fruch commented 1 year ago

@fruch is there a priority label or triage label in this repo?

have you seen this happening in more places ?

roydahan commented 1 year ago

No, just wondered if this is high priority issue or not (since no one really tracks the issues in this repo for now).

fruch commented 1 year ago

No, just wondered if this is high priority issue or not (since no one really tracks the issues in this repo for now).

that code is there for ages, I've seen it once on an SCT run I'm guessing it happens more times but we didn't noticed/digged into the actual reason.

it technically might happen in any query that changes schema.

fruch commented 1 year ago

seen it again in SCT:

2023-09-30 03:57:53.448: (DisruptionEvent Severity.ERROR) period_type=end event_id=61561cd1-abb2-4ff9-8b5d-f09bdb4574bf duration=19s: nemesis_name=ToggleTableGcMode target_node=Node longevity-parallel-topology-schema--db-node-b4eba44d-2 [54.74.133.187 | 10.4.11.167] (seed: True) errors=Encountered a bad command exit code!
Command: '/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "ALTER TABLE  scylla_bench.test WITH tombstone_gc = {\'mode\': \'repair\'};" 10.4.11.167 9042'
Exit code: 2
Stdout:
Stderr:
<stdin>:1:Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5029, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2722, in disrupt_toggle_table_gc_mode
self.toggle_table_gc_mode()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2421, in toggle_table_gc_mode
self.target_node.run_cqlsh(cmd)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2852, in run_cqlsh
cqlsh_out = self.remoter.run(cmd, timeout=timeout + 120,  # we give 30 seconds to cqlsh timeout mechanism to work
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: '/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "ALTER TABLE  scylla_bench.test WITH tombstone_gc = {\'mode\': \'repair\'};" 10.4.11.167 9042'
Exit code: 2
Stdout:
Stderr:
<stdin>:1:Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.

Installation details

Kernel Version: 5.15.0-1045-aws Scylla version (or git commit hash): 5.4.0~dev-20230927.0f22e8d196af with build-id 2c911e6e2b12c7d0c19f67f76711e0c1adfea3cb

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0c25786faf310fa10 (aws: undefined_region)

Test: longevity-schema-topology-changes-12h-test Test id: b4eba44d-f499-4758-b9c6-2c7f9304b2df Test name: scylla-master/longevity/longevity-schema-topology-changes-12h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor b4eba44d-f499-4758-b9c6-2c7f9304b2df` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=b4eba44d-f499-4758-b9c6-2c7f9304b2df) - Show all stored logs command: `$ hydra investigate show-logs b4eba44d-f499-4758-b9c6-2c7f9304b2df` ## Logs: - **db-cluster-b4eba44d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/db-cluster-b4eba44d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/db-cluster-b4eba44d.tar.gz) - **sct-runner-events-b4eba44d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/sct-runner-events-b4eba44d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/sct-runner-events-b4eba44d.tar.gz) - **sct-b4eba44d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/sct-b4eba44d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/sct-b4eba44d.log.tar.gz) - **loader-set-b4eba44d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/loader-set-b4eba44d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/loader-set-b4eba44d.tar.gz) - **monitor-set-b4eba44d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/monitor-set-b4eba44d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b4eba44d-f499-4758-b9c6-2c7f9304b2df/20230930_083412/monitor-set-b4eba44d.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-schema-topology-changes-12h-test/16/) [Argus](https://argus.scylladb.com/test/024e78ba-3671-4fd0-bafc-b48b2a978f59/runs?additionalRuns[]=b4eba44d-f499-4758-b9c6-2c7f9304b2df)
fruch commented 1 year ago

in SCT we might have schema changes during other nemesis that uses cqlsh, and it's fine to not have schema agreement at that point

fgelcer commented 1 year ago

in SCT we might have schema changes during other nemesis that uses cqlsh, and it's fine to not have schema agreement at that point

if that is the case, then when we do have schema changes, it should enable a filter to ignore (and perhaps add retries) to these CQL commands, but the commands themselves should fail, if without schema changes we do hit these schema disagreements

fruch commented 1 year ago

cqlsh can warn about it, it shouldn't fail

SCT shouldn't be failing cause of wanings

roydahan commented 11 months ago

@fruch I just saw you have a PR for fixing it.

juliayakovlev commented 10 months ago

2023.1.4

Nemesis Information
Class: Sisyphus
Name: disrupt_toggle_table_ics
Status: Failed
Failure reason
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5062, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2784, in disrupt_toggle_table_ics
    self.toggle_table_ics()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2524, in toggle_table_ics
    raise unexpected_exit
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2519, in toggle_table_ics
    self.target_node.run_cqlsh(cmd)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2744, in run_cqlsh
    cqlsh_out = self.remoter.run(cmd, timeout=timeout + 120,  # we give 30 seconds to cqlsh timeout mechanism to work
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 614, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 70, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 605, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 538, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 655, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: '/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "ALTER TABLE  scylla_bench.test WITH compaction = {\'class\': \'IncrementalCompactionStrategy\', \'bucket_high\': 1.5, \'bucket_low\': 0.42, \'min_sstable_size\': 422, \'min_threshold\': 5, \'max_threshold\': 35, \'sstable_size_in_mb\': 352};" 10.12.11.74'

Exit code: 2

Stdout:

Stderr:

<stdin>:1:Warning: schema version mismatch detected; check the schema versions of your nodes in system.local and system.peers.

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1051-aws Scylla version (or git commit hash): 2023.1.4-20240116.bedad080681e with build-id c0c543b5c81473e26e5111d8f379a77b786bc450

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-001477f22cadc6d9c (aws: undefined_region)

Test: longevity-schema-topology-changes-12h-test Test id: 73ec8004-3881-4cbb-b37b-11c75c1c1126 Test name: enterprise-2023.1/longevity/longevity-schema-topology-changes-12h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 73ec8004-3881-4cbb-b37b-11c75c1c1126` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=73ec8004-3881-4cbb-b37b-11c75c1c1126) - Show all stored logs command: `$ hydra investigate show-logs 73ec8004-3881-4cbb-b37b-11c75c1c1126` ## Logs: - **db-cluster-73ec8004.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/db-cluster-73ec8004.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/db-cluster-73ec8004.tar.gz) - **sct-runner-events-73ec8004.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/sct-runner-events-73ec8004.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/sct-runner-events-73ec8004.tar.gz) - **sct-73ec8004.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/sct-73ec8004.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/sct-73ec8004.log.tar.gz) - **loader-set-73ec8004.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/loader-set-73ec8004.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/loader-set-73ec8004.tar.gz) - **monitor-set-73ec8004.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/monitor-set-73ec8004.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/73ec8004-3881-4cbb-b37b-11c75c1c1126/20240117_051521/monitor-set-73ec8004.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/longevity/job/longevity-schema-topology-changes-12h-test/11/) [Argus](https://argus.scylladb.com/test/079cddb4-4eac-4594-a2f0-8452dab5eb40/runs?additionalRuns[]=73ec8004-3881-4cbb-b37b-11c75c1c1126)