scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
58 stars 95 forks source link

Add testcase for scaling-in with 4 node cluster at 67% #9165

Open Lakshmipathi opened 1 week ago

Lakshmipathi commented 1 week ago
Lakshmipathi commented 1 week ago

Starting with 4node cluster.

reached 67% disk usage and started waiting for 30mins, no write or read.

[2024-11-09T08:32:41.257Z] < t:2024-11-09 08:32:41,038 f:full_storage_utilization_test.py l:157  c:FullStorageUtilizationTest p:INFO  > Current max disk usage after writing to keyspace12: 67% (292 GB / 292.12 GB)
[2024-11-09T08:32:43.574Z] < t:2024-11-09 08:32:43,091 f:full_storage_utilization_test.py l:125  c:FullStorageUtilizationTest p:INFO  > Wait for 1800 seconds

After 30min idle time, started throttled write:

[2024-11-09T09:04:17.601Z] < t:2024-11-09 09:04:16,771 f:stress_thread.py l:325  c:sdcm.stress_thread   p:INFO  > cassandra-stress write no-warmup duration=30m -rate threads=10 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema keyspace=keyspace1 "replication(strategy=NetworkTopologyStrategy,replication_factor=3)" -node 10.4.0.147,10.4.1.54,10.4.1.222,10.4.3.28 -errors skip-unsupported-columns

few minutes later, removing a node from 4-node cluster.

[2024-11-09T09:08:27.538Z] < t:2024-11-09 09:08:12,621 f:full_storage_utilization_test.py l:169  c:FullStorageUtilizationTest p:INFO  > Removing a second node from the cluster
[2024-11-09T09:25:51.985Z] < t:2024-11-09 09:25:51,026 f:common.py       l:43   c:sdcm.utils.tablets.common p:INFO  > Waiting for tablets to be balanced
[2024-11-09T09:26:55.605Z] < t:2024-11-09 09:26:53,243 f:common.py       l:48   c:sdcm.utils.tablets.common p:INFO  > Tablets are balanced
[2024-11-09T09:26:55.605Z] < t:2024-11-09 09:26:53,243 f:full_storage_utilization_test.py l:64   c:FullStorageUtilizationTest p:INFO  > Removing a node finished with time: 1120.621938943863

Final 3node cluster has disk usage at 88%,88% and 89%

Image

https://argus.scylladb.com/tests/scylla-cluster-tests/b028c83a-47cf-404c-a649-2b714acba766 https://jenkins.scylladb.com/job/scylla-staging/job/LakshmipathiGanapathi/job/byo-longevity-test/209/consoleText

Lakshmipathi commented 1 week ago

@paszkow @pehala @swasik , in the description we mentioned it as 30mins of read-only time before topology change. But the script current performs no read or no write for this duration. Is it fine or we need to have read op during this 30min before scale-in?

swasik commented 1 week ago

I think that in this case it is fine. The objective of this one is to be simple correctness test so no read is not a problem.