scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

Upgrade test may use latest available version as base one #6640

Open vponomaryov opened 11 months ago

vponomaryov commented 11 months ago

Issue description

The minor 5.2 upgrade job failed with the following error:

2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2023-09-28 16:25:43.865: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=e4eb6112-3c7d-4737-8e68-3221cfe805c0, source=UpgradeTest.test_rolling_upgrade (upgrade_test.UpgradeTest)() message=Traceback (most recent call last):
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 594, in test_rolling_upgrade
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     self.upgrade_node(self.db_cluster.node_to_upgrade)
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 60, in inner
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     func_result = func(self, *args, **kwargs)
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 256, in inner_func
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     return func(*args, **kwargs)
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 263, in upgrade_node
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     assert self.orig_ver != new_ver, "scylla-server version isn't changed"
2023-09-28 16:25:43,867 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > AssertionError: scylla-server version isn't changed

The root cause for it is that the base version was the latest one:

2023-09-28 14:00:09,483 f:cluster.py      l:2148 c:sdcm.cluster_gce     p:INFO  > Node rolling-upgrade--debian-buster-db-node-cd81cf9b-0-1 [35.227.75.195 | 10.142.0.68] (seed: True): Found ScyllaDB version with details: 5.2.9-0.20230920.5709d0043978 with build-id 686601fd1656c6724f7f042163b9285bf3efd582

But it should have been 5.2.8.

Impact

Upgrade job fails.

How frequently does it reproduce?

Observed first time.

Installation details

Kernel Version: 4.19.0-25-cloud-amd64 Scylla version (or git commit hash): 5.2.9-20230920.5709d0043978 with build-id 686601fd1656c6724f7f042163b9285bf3efd582

Cluster size: 4 nodes (n1-highmem-8)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/debian-cloud/global/images/family/debian-10 (gce: us-east1)

Test: rolling-upgrade-debian10-test Test id: cd81cf9b-710b-4210-98a2-6b15057a34a9 Test name: scylla-5.2/rolling-upgrade/rolling-upgrade-debian10-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor cd81cf9b-710b-4210-98a2-6b15057a34a9` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=cd81cf9b-710b-4210-98a2-6b15057a34a9) - Show all stored logs command: `$ hydra investigate show-logs cd81cf9b-710b-4210-98a2-6b15057a34a9` ## Logs: - **db-cluster-cd81cf9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/db-cluster-cd81cf9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/db-cluster-cd81cf9b.tar.gz) - **sct-runner-events-cd81cf9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/sct-runner-events-cd81cf9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/sct-runner-events-cd81cf9b.tar.gz) - **sct-cd81cf9b.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/sct-cd81cf9b.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/sct-cd81cf9b.log.tar.gz) - **monitor-set-cd81cf9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/monitor-set-cd81cf9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/monitor-set-cd81cf9b.tar.gz) - **loader-set-cd81cf9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/loader-set-cd81cf9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/cd81cf9b-710b-4210-98a2-6b15057a34a9/20230928_164017/loader-set-cd81cf9b.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-5.2/job/rolling-upgrade/job/rolling-upgrade-debian10-test/27/) [Argus](https://argus.scylladb.com/test/778af571-ea22-4c8f-ab32-34774c8fbac0/runs?additionalRuns[]=cd81cf9b-710b-4210-98a2-6b15057a34a9)
fgelcer commented 11 months ago

But it should have been 5.2.8.

why? if 5.2.9 was promoted at the time the test ran, then the latest available base version is 5.2.9 and not 5.2.8

note that, at any run, we can set the base version[s] to be what you would expect, if you know what to use

fruch commented 11 months ago

But it should have been 5.2.8.

why? if 5.2.9 was promoted at the time the test ran, then the latest available base version is 5.2.9 and not 5.2.8

note that, at any run, we can set the base version[s] to be what you would expect, if you know what to use

I think those happen on reruns after the version was promoted already

vponomaryov commented 11 months ago

@fgelcer , @fruch Do I understand it correctly that I needed to update the new_scylla_repo option value before rerunning? If so, yes it is configuration issue and then this bugreport may be closed.

fruch commented 11 months ago

@fgelcer , @fruch Do I understand it correctly that I needed to update the new_scylla_repo option value before rerunning? If so, yes it is configuration issue and then this bugreport may be closed.

We can close, but from other hand this confusion happens again and again (for years now)

Maybe we should minimize the time it takes to figure it out, and notify the user a bit more clearly about it.

Before we even start all the upgrade process, which can be in a matter of seconds, and not hour+ like it is now.

fgelcer commented 11 months ago

@fgelcer , @fruch Do I understand it correctly that I needed to update the new_scylla_repo option value before rerunning? If so, yes it is configuration issue and then this bugreport may be closed.

well, it depends on what you plan to do... let's say you are testing 5.2.9, so when you use the repo URL that points to the latest, you should install the latest promoted (that in a regular flow, before we release patch releases) will be 5.2.8

in case you want to rebuild the exact same scenario (upgrade from 5.2.8 to 5.2.9) you need to set the base version to be exactly 5.2.8

the mechanism that calculates the base version checks what version we have in the files, in the repo URL, and check latest releases to achieve that (in case of 5.2, it should have 2 base versions, 5.1.x and 5.2.y)

if you already have a new 5.2.Z (built to debug something for example) and you want to use the latest released 5.2, then you need only to set the new_repo poiting to that new URL

fgelcer commented 11 months ago

@fgelcer , @fruch Do I understand it correctly that I needed to update the new_scylla_repo option value before rerunning? If so, yes it is configuration issue and then this bugreport may be closed.

We can close, but from other hand this confusion happens again and again (for years now)

Maybe we should minimize the time it takes to figure it out, and notify the user a bit more clearly about it.

Before we even start all the upgrade process, which can be in a matter of seconds, and not hour+ like it is now.

we could, but not sure it is something easy to do, as we depend on files downloaded by the OS from the repo installed on it, so you will have it only after installing the files after the upgrade

fruch commented 11 months ago

@fgelcer , @fruch Do I understand it correctly that I needed to update the new_scylla_repo option value before rerunning? If so, yes it is configuration issue and then this bugreport may be closed.

We can close, but from other hand this confusion happens again and again (for years now)

Maybe we should minimize the time it takes to figure it out, and notify the user a bit more clearly about it.

Before we even start all the upgrade process, which can be in a matter of seconds, and not hour+ like it is now.

we could, but not sure it is something easy to do, as we depend on files downloaded by the OS from the repo installed on it, so you will have it only after installing the files after the upgrade

If the base is installed from repo, and not from images it's easy to check, and we should have the code to do so, without installation.

Anyhow at least the failure should mention this possible reason, or reference to issue like this one

fgelcer commented 11 months ago

@fgelcer , @fruch Do I understand it correctly that I needed to update the new_scylla_repo option value before rerunning? If so, yes it is configuration issue and then this bugreport may be closed.

We can close, but from other hand this confusion happens again and again (for years now) Maybe we should minimize the time it takes to figure it out, and notify the user a bit more clearly about it. Before we even start all the upgrade process, which can be in a matter of seconds, and not hour+ like it is now.

we could, but not sure it is something easy to do, as we depend on files downloaded by the OS from the repo installed on it, so you will have it only after installing the files after the upgrade

If the base is installed from repo, and not from images it's easy to check, and we should have the code to do so, without installation.

barely, because if not using pre-installed images, we have to install the base version first, and then the "upgrade" version

Anyhow at least the failure should mention this possible reason, or reference to issue like this one

agree with that, but who is now failing is the upgrade function, finding out that the version did not change

fruch commented 11 months ago

@fgelcer , @fruch Do I understand it correctly that I needed to update the new_scylla_repo option value before rerunning? If so, yes it is configuration issue and then this bugreport may be closed.

We can close, but from other hand this confusion happens again and again (for years now) Maybe we should minimize the time it takes to figure it out, and notify the user a bit more clearly about it. Before we even start all the upgrade process, which can be in a matter of seconds, and not hour+ like it is now.

we could, but not sure it is something easy to do, as we depend on files downloaded by the OS from the repo installed on it, so you will have it only after installing the files after the upgrade

If the base is installed from repo, and not from images it's easy to check, and we should have the code to do so, without installation.

barely, because if not using pre-installed images, we have to install the base version first, and then the "upgrade" version

get_branch_version function should be able to do so without installation or upgrade.

Anyhow at least the failure should mention this possible reason, or reference to issue like this one

agree with that, but who is now failing is the upgrade function, finding out that the version did not change