scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

Decommission operation timeout is too small for DB cluster with enabled tablets feature #8855

Open vponomaryov opened 3 days ago

vponomaryov commented 3 days ago

Issue description

Nemesis disrupt_decommission_streaming_err times out in the test with enabled tablets:

2024-09-21 07:32:46.414: (ClusterHealthValidatorEvent Severity.CRITICAL) period_type=one-time \
    event_id=2d16b283-0ee6-4651-9862-d60efd70c2d3: type=NodeStatus \
    node=longevity-large-partitions-200k-pks-db-node-b866c292-0-2 \
    error=Current node Node longevity-large-partitions-200k-pks-db-node-b866c292-0-2 [34.148.211.120 | 10.142.0.128]. \
    Node Node longevity-large-partitions-200k-pks-db-node-b866c292-0-5 [35.185.64.144 | 10.142.0.139] \
    (DecommissionStreamingErr nemesis target node) status is UL

Argus: Screenshot from 2024-09-26 22-22-11

Looking at the cluster state everything was going ok all that time while timeout was not reached.

So, I increased the timeout for it and ran another test here: scylla-staging/valerii/vp-longevity-large-partition-200k-pks-4days-gce-test#4

After timeout increase the nemesis passed: Screenshot from 2024-09-26 22-23-17

Note that in the passed variant the add node operation after decommission was started at 11:39. So, decommission under heavy write load took about 2h15m whereas current timeout is 1h20m.

Steps to Reproduce

  1. Setup a DB cluster with tablets
  2. Run heavy write load with large partitions
  3. Run disrupt_decommission_streaming_err nemesis

Expected behavior: SCT waits proper amount of time

Actual behavior: SCT raises timeout error too early

Impact

False negative

How frequently does it reproduce?

100%

Installation details

SCT Version: master Scylla version (or git commit hash): master/6.3

Logs

fruch commented 14 hours ago

@aleksbykov

isn't that something you already identified ? during this case ? has been any work related to fixed that ?