Open pehala opened 3 weeks ago
3-node (Instance type: i4i.large) cluster scaleout at 90%.
reached 91% disk usage and started waiting for 30mins, no write or read.
< t:2024-11-03 07:10:58,323 f:full_storage_utilization_test.py l:93 c:FullStorageUtilizationTest p:INFO > Current max disk usage after writing to keyspace10: 91% (396 GB / 392.40000000000003 GB)
< t:2024-11-03 07:10:59,353 f:full_storage_utilization_test.py l:58 c:FullStorageUtilizationTest p:INFO > Wait for 1800 seconds
After 30min idle time, started throttled write:
< t:2024-11-03 07:42:10,941 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=cassandra-stress write duration=30m -rate threads=10 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema "replication(strategy=NetworkTopologyStrategy,replication_factor=3)"
Scaleout by adding a new node at 90%
< t:2024-11-03 07:44:05,075 f:full_storage_utilization_test.py l:41 c:FullStorageUtilizationTest p:INFO > Adding a new node
After 30mins, scaleout (3->4) cluster has disk usage at 75%, 74%, 75% and 70%
Tablet migration over time
max/avg disk utilization
Latency 99th percentile write and read latency by Cluster (max at 90% disk utilization)
syscall | value |
---|---|
writes | 3.07ms |
read | 1.79ms |
https://argus.scylladb.com/tests/scylla-cluster-tests/c5de2f39-770c-4cf3-8d8c-66fef9d91d87
After 30mins, scaleout (3->4) cluster has disk usage at 75%, 74%, 75% and 70%
But I see that the chart presents average disk usage. This should change quickly as we are adding more disk space even if the new space is not used. Could you also add picture for maximal disk usage across all nodes?
The interesting fact is that after migration we have the same number of tablets everywhere but on the new node the disk utilization is ca. 5% lower. Maybe something is not cleaned yet. Could we wait a bit more time to see if the utilization will be equal in the end?
Started new job, with 1hr wait time just before the test ends. Will check and update whether 5% lower disk usage still exists or not.
@swasik After scaleout, waited for 40mins and ensured there is 0% load on all nodes. Final disk usage is: 66%, 69%, 71% and 73%. So on avg, the newly added node has 5% less disk usage than other 3-nodes.
Could it be due to tablet inbalance?
Could it be due to tablet inbalance?
I thought so too, but we have exactly the same number of tablets at each node and probably linear distribution of keyspace.