scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 94 forks source link

When starting a longevity with a scylla_repo installation instead of scylla AMI, scylla is not installed on any node that is added by nemeses #6591

Open ShlomiBalalis opened 1 year ago

ShlomiBalalis commented 1 year ago

Prerequisites

Versions

Logs

Description

This longevity, unlike most, doesn't use a Scylla image, but instead uses a clean Ubuntu image and installs scylla on top of it. At first, the cluster starts just fine, and the first few nemeses runs are completely fine, but as soon as a nemesis tries to add a new node, SCT does not install scylla on the new node, and thus the nemesis fails:

< t:2023-09-10 22:13:25,279 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/scylla --version"...
< t:2023-09-10 22:13:25,780 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > bash: /usr/bin/scylla: No such file or directory
< t:2023-09-10 22:13:25,780 f:cluster.py      l:2122 c:sdcm.cluster_aws     p:DEBUG > Node longevity-fips-2023-1-db-node-86ad51f9-7 [18.202.219.215 | 10.4.2.30] (seed: False): Unable to get ScyllaDB version using `/usr/bin/scylla --version':
< t:2023-09-10 22:13:25,780 f:cluster.py      l:2122 c:sdcm.cluster_aws     p:DEBUG > 
< t:2023-09-10 22:13:25,780 f:cluster.py      l:2122 c:sdcm.cluster_aws     p:DEBUG > bash: /usr/bin/scylla: No such file or directory
< t:2023-09-10 22:13:25,780 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "dpkg-query --show --showformat '${Version}' scylla"...
< t:2023-09-10 22:13:26,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > total,      77247749,    9185,    9185,    9185,     4.3,     3.3,    10.5,    14.7,    31.9,    40.6, 8345.0,  0.00345,      0,      0,       0,       0,       0,       0
< t:2023-09-10 22:13:26,014 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > total,      79588055,   10020,   10020,   10020,     3.9,     3.2,     8.6,    13.3,    23.8,    39.7, 8080.0,  0.00063,      0,      0,       0,       0,       0,       0
< t:2023-09-10 22:13:26,282 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > dpkg-query: no packages found matching scylla
< t:2023-09-10 22:13:26,282 f:cluster.py      l:2134 c:sdcm.cluster_aws     p:DEBUG > Node longevity-fips-2023-1-db-node-86ad51f9-7 [18.202.219.215 | 10.4.2.30] (seed: False): Unable to get ScyllaDB version using `dpkg-query --show --showformat '${Version}' scylla':
< t:2023-09-10 22:13:26,282 f:cluster.py      l:2134 c:sdcm.cluster_aws     p:DEBUG > 
< t:2023-09-10 22:13:26,282 f:cluster.py      l:2134 c:sdcm.cluster_aws     p:DEBUG > dpkg-query: no packages found matching scylla
< t:2023-09-10 22:13:26,282 f:cluster.py      l:2144 c:sdcm.cluster_aws     p:WARNING > Node longevity-fips-2023-1-db-node-86ad51f9-7 [18.202.219.215 | 10.4.2.30] (seed: False): All attempts to get ScyllaDB version failed. Looks like there is no ScyllaDB installed.
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <bound method Nemesis.disrupt_grow_shrink_cluster of <sdcm.nemesis.SisyphusMonkey object at 0x7f7a4c0f2770>> < t:2023-09-10 22:13
:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <bound method Nemesis.disrupt_grow_shrink_cluster of <sdcm.nemesis.SisyphusMonkey object at 0x7f7a4c0f2770>>
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4479, in wrapper
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     result = method(*args[1:], **kwargs)
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3725, in disrupt_grow_shrink_cluster
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     self._grow_cluster(rack=None)
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3746, in _grow_cluster
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     added_node = self.add_new_node(rack=rack_idx)
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 171, in wrapped
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     res = func(*args, **kwargs)
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3696, in add_new_node
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     return self._add_and_init_new_cluster_node(rack=rack)
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1211, in _add_and_init_new_cluster_node
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     if new_node.is_replacement_by_host_id_supported:
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2774, in is_replacement_by_host_id_supported
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     return ComparableScyllaVersion(self.scylla_version) > '5.2.0~dev'
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/version_utils.py", line 123, in __init__
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     parsed_version = self.parse(version_string)
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/version_utils.py", line 165, in parse
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR >     raise ValueError(
< t:2023-09-10 22:13:26,285 f:nemesis.py      l:4490 c:sdcm.nemesis         p:ERROR > ValueError: Cannot parse provided 'None' scylla_version for the comparison. Transformed scylla_version: 

Steps to Reproduce

  1. start a longevity that uses a clean machine image with scylla installed on top of it
  2. start a nemesis that adds a new node
  3. the nemesis will fail, since scylla is not installed on the new node

Expected behavior: scylla will be installed on any new node

Actual behavior: scylla is not installed

Installation details

Kernel Version: 5.4.0-1021-aws-fips Scylla version (or git commit hash): 2023.1.1-20230906.f4633ec973b0 with build-id b454e7a22f80cf71a33b2f39e47127225e8fbc13

Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-014603057f9da7d50 (aws: eu-west-1)

Test: longevity-100gb-4h-fips-test Test id: 86ad51f9-6a75-466b-8a07-da7553a7ac48 Test name: enterprise-2023.1/SCT_Enterprise_Features/FIPS/longevity-100gb-4h-fips-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 86ad51f9-6a75-466b-8a07-da7553a7ac48` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=86ad51f9-6a75-466b-8a07-da7553a7ac48) - Show all stored logs command: `$ hydra investigate show-logs 86ad51f9-6a75-466b-8a07-da7553a7ac48` ## Logs: - **db-cluster-86ad51f9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/db-cluster-86ad51f9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/db-cluster-86ad51f9.tar.gz) - **sct-runner-events-86ad51f9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/sct-runner-events-86ad51f9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/sct-runner-events-86ad51f9.tar.gz) - **sct-86ad51f9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/sct-86ad51f9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/sct-86ad51f9.log.tar.gz) - **monitor-set-86ad51f9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/monitor-set-86ad51f9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/monitor-set-86ad51f9.tar.gz) - **loader-set-86ad51f9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/loader-set-86ad51f9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/loader-set-86ad51f9.tar.gz) - **parallel-timelines-report-86ad51f9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/parallel-timelines-report-86ad51f9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/86ad51f9-6a75-466b-8a07-da7553a7ac48/20230911_001642/parallel-timelines-report-86ad51f9.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/SCT_Enterprise_Features/job/FIPS/job/longevity-100gb-4h-fips-test/3/) [Argus](https://argus.scylladb.com/test/0acad768-a634-439e-81c7-24002913e027/runs?additionalRuns[]=86ad51f9-6a75-466b-8a07-da7553a7ac48)
fruch commented 10 months ago

as noted in https://github.com/scylladb/scylla-cluster-tests/pull/6596#issuecomment-1718335013

this is a can of worm, since all of the nemesis code doesn't handle the case we need to install scylla on a node there much more work left for this longevity base on scylla_repo to be working

roydahan commented 10 months ago

Let's change this longevity to run only "non-disruptive" nemesis. There is nothing special here that we need it to test topology changes or things like that.