scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
58 stars 95 forks source link

Add testcase for scaling-in while 3-node cluster having 90% storage utilization #9131

Open pehala opened 3 weeks ago

pehala commented 3 weeks ago
paszkow commented 3 weeks ago

@pehala This scenario seems to be incorrect. Without deletes you will hit out of space error once you scale in. I think we should rather aim at having 5 nodes at ~ 72% of disk utilization and then scale-in. As a result you will end up with 4 node with ~90%

swasik commented 3 weeks ago

@pehala This scenario seems to be incorrect. Without deletes you will hit out of space error once you scale in. I think we should rather aim at having 5 nodes at ~ 72% of disk utilization and then scale-in. As a result you will end up with 4 node with ~90%

But in this scenario you perform scale out before scale in. So, if I understand correctly it is add node 4, then remove node 3 so in practice swap node 3 to 4.

Lakshmipathi commented 2 weeks ago

I updated this description a bit. Based on the suggestion on testplan document, we have two variant for scale-in. a) 3node-cluster scale-in at 90% b) 4node-cluster scale-in at 67%.

For 3node-cluster scale-in at 90%, add a new node once tablet migration completed. Drop 20% of data from the cluster and then scale-in by removing a node. For 4node-cluster scale-in at 67%, we scale-in by removing a node, after tablet migration, cluster will be at around 90% storage utilization.

Lakshmipathi commented 2 weeks ago

reached 92% disk usage and started waiting for 30mins, no write or read.

< t:2024-11-05 09:36:49,314 f:full_storage_utilization_test.py l:121  c:FullStorageUtilizationTest p:INFO  > Current max disk usage after writing to keyspace10: 92% (398 GB / 392.40000000000003 GB)
< t:2024-11-05 09:36:50,342 f:full_storage_utilization_test.py l:87   c:FullStorageUtilizationTest p:INFO  > Wait for 1800 seconds

After 30min idle time, started throttled write:

< t:2024-11-05 10:08:01,521 f:stress_thread.py l:325  c:sdcm.stress_thread   p:INFO  > cassandra-stress write no-warmup duration=30m -rate threads=10 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema keyspace=keyspace1 "replication(strategy=NetworkTopologyStrategy,replication_factor=3)" -node 10.4.1.62,10.4.3.97,10.4.1.100 -errors skip-unsupported-columns

Scaleout by adding a new node at 90%

< t:2024-11-05 10:09:57,086 f:full_storage_utilization_test.py l:35   c:FullStorageUtilizationTest p:INFO  > Adding a new node
< t:2024-11-05 10:12:55,534 f:common.py       l:43   c:sdcm.utils.tablets.common p:INFO  > Waiting for tablets to be balanced
< t:2024-11-05 10:40:55,031 f:common.py       l:48   c:sdcm.utils.tablets.common p:INFO  > Tablets are balanced

Later, dropping some data before scale-in

< t:2024-11-05 10:40:55,031 f:full_storage_utilization_test.py l:48   c:FullStorageUtilizationTest p:INFO  > Dropping some data

few minutes later, removing a node from 3-node cluster.

< t:2024-11-05 10:41:00,079 f:full_storage_utilization_test.py l:40   c:FullStorageUtilizationTest p:INFO  > Removing a node
< t:2024-11-05 10:41:00,080 f:full_storage_utilization_test.py l:133  c:FullStorageUtilizationTest p:INFO  > Removing a second node from the cluster
< t:2024-11-05 10:41:00,080 f:full_storage_utilization_test.py l:135  c:FullStorageUtilizationTest p:INFO  > Node to be removed: df-test-master-db-node-1ffa6d64-2

Tablet migration over time Image

max/avg disk utilization Image

Latency 99th percentile write and read latency by Cluster (max at 90% disk utilization)

syscall value
writes 1.79ms
read 3.58ms

Final 3node cluster has disk usage at 92%,91% and 87%

https://argus.scylladb.com/tests/scylla-cluster-tests/1ffa6d64-004a-4443-a3c9-d52a18ea08e1

swasik commented 2 weeks ago

Final 3node cluster has disk usage at 92%,91% and 87%

But if dropping 20% of data as suggested in the test plan, should not we get ca. 70% here? It was incorrectly stated in the doc - I fixed it. The idea behind it is to simulate the scenario where we loose plenty of data and because of it we can scale in to save resources.

pehala commented 1 day ago

But if dropping 20% of data as suggested in the test plan, should not we get ca. 70% here? It was incorrectly stated in the doc - I fixed it. The idea behind it is to simulate the scenario where we loose plenty of data and because of it we can scale in to save resources.

@Lakshmipathi ping

Lakshmipathi commented 19 hours ago

But if dropping 20% of data as suggested in the test plan, should not we get ca. 70% here? @swasik , Here is the flow for this case:

  1. In a 3-node cluster, we reached 92% disk usage.
  2. Wait for 30mins.
  3. Started throttled write.
  4. Now add a new node at 90%, total-nodes in the cluster=4
  5. From the graph, we can see avg disk usage drops after this operation.
  6. Wait for 30mins
  7. Drop 20% of data
  8. Start throttled write.
  9. Perform scale-in.

If I'm not wrong, the throttled write we do during scaling operation (3 and 8) - contributes to addition disk usage. Let me add more graphs to this issue.