scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

Upgrade db packages stops all nodes when growing cluster by 3 in parallel (custom db packages) #8551

Open soyacz opened 2 weeks ago

soyacz commented 2 weeks ago

A test with custom scylla db packages (update_db_packages param set). When growing cluster by 3 in parallel, SCT stops all the nodes instead of only added ones. Culprit line: https://github.com/scylladb/scylla-cluster-tests/blob/066dd0231cd80ccb29e9d503c4f22cc6221a912a/sdcm/cluster.py#L4220

Impact

Fail the test due c-s errors when stopping all the nodes.

How frequently does it reproduce?

Always when growing in parallel and using custom db packages.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0415b87a177bf40a6 (aws: undefined_region)

Test: scylla-enterprise-perf-regression-latency-650gb-elasticity Test id: bc75f3a1-389f-4c3e-a84f-ef388d9bd03c Test name: scylla-staging/lukasz/scylla-enterprise-perf-regression-latency-650gb-elasticity Test method: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor bc75f3a1-389f-4c3e-a84f-ef388d9bd03c` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=bc75f3a1-389f-4c3e-a84f-ef388d9bd03c) - Show all stored logs command: `$ hydra investigate show-logs bc75f3a1-389f-4c3e-a84f-ef388d9bd03c` ## Logs: - **db-cluster-bc75f3a1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/db-cluster-bc75f3a1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/db-cluster-bc75f3a1.tar.gz) - **sct-runner-events-bc75f3a1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/sct-runner-events-bc75f3a1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/sct-runner-events-bc75f3a1.tar.gz) - **sct-bc75f3a1.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/sct-bc75f3a1.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/sct-bc75f3a1.log.tar.gz) - **loader-set-bc75f3a1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/loader-set-bc75f3a1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/loader-set-bc75f3a1.tar.gz) - **monitor-set-bc75f3a1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/monitor-set-bc75f3a1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/bc75f3a1-389f-4c3e-a84f-ef388d9bd03c/20240903_130445/monitor-set-bc75f3a1.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/lukasz/job/scylla-enterprise-perf-regression-latency-650gb-elasticity/2/) [Argus](https://argus.scylladb.com/test/e1f8e3fc-cc80-48b7-bd28-39e08bd53d2c/runs?additionalRuns[]=bc75f3a1-389f-4c3e-a84f-ef388d9bd03c)
fruch commented 1 week ago

@soyacz this logic can go, we don't care about the ordering of starting nodes anymore, we can remove that if, and remove all the else branch

we should just stop/stop the node that are being asked, we shouldn't touch any other nodes at that point, it's a mistake

soyacz commented 1 week ago

@soyacz this logic can go, we don't care about the ordering of starting nodes anymore, we can remove that if, and remove all the else branch

we should just stop/stop the node that are being asked, we shouldn't touch any other nodes at that point, it's a mistake

Yes, shouldn't be hard to fix, let's plan it for this sprint.