[RAFT] A new node init does not complete (test timeout after waiting for 1 hour)

yarongilor commented 5 months ago

Packages

Scylla version: 5.5.0~dev-20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12

Kernel Version: 5.15.0-1060-aws

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it. When node bootstrap is interrupted with RAFT - the node becomes banned and cannot join the cluster until runs a cleanup (remove host-id).

2024-05-16 10:28:33.282: (SoftTimeoutEvent Severity.ERROR) period_type=one-time event_id=875c02fd-91e3-4ed8-b64c-852b78b8ef47 during_nemesis=BootstrapStreamingError, source=SoftTimeout message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 423, in run
self.disrupt()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 6306, in disrupt
self.call_next_nemesis()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1916, in call_next_nemesis
self.execute_disrupt_method(disrupt_method=next(self.disruptions_cycle))
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1842, in execute_disrupt_method
disrupt_method()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5117, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4973, in disrupt_bootstrap_streaming_error
bootstrapabortmanager.clean_and_restart_bootstrap_after_abort()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/common.py", line 145, in clean_and_restart_bootstrap_after_abort
with adaptive_timeout(operation=Operations.NEW_NODE, node=self.verification_node, timeout=3600) as bootstrap_timeout:
File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/adaptive_timeouts/__init__.py", line 112, in adaptive_timeout
SoftTimeoutEvent(operation=operation.name, soft_timeout=timeout, duration=duration).publish_or_dump()

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-100gb-4h-master-db-node-d6d9cb1d-9 (63.32.105.88 | 10.4.8.149) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-8 (18.201.71.67 | 10.4.8.4) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-7 (52.50.86.221 | 10.4.9.156) (shards: -1)
longevity-100gb-4h-master-db-node-d6d9cb1d-6 (54.217.170.24 | 10.4.10.182) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-5 (34.243.191.244 | 10.4.10.170) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-4 (52.213.250.163 | 10.4.11.144) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-3 (3.248.233.136 | 10.4.9.127) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-2 (3.252.90.13 | 10.4.11.80) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-10 (3.253.29.31 | 10.4.11.207) (shards: 14)
longevity-100gb-4h-master-db-node-d6d9cb1d-1 (34.253.40.12 | 10.4.11.45) (shards: 14)

OS / Image: ami-03e1418be972f5f58 (aws: undefined_region)

Test: longevity-100gb-4h-test Test id: d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3 Test name: scylla-master/tablets/longevity-100gb-4h-test Test config file(s):

longevity-100gb-4h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3) - Show all stored logs command: `$ hydra investigate show-logs d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3` ## Logs: - **db-cluster-d6d9cb1d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/db-cluster-d6d9cb1d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/db-cluster-d6d9cb1d.tar.gz) - **sct-runner-events-d6d9cb1d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/sct-runner-events-d6d9cb1d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/sct-runner-events-d6d9cb1d.tar.gz) - **sct-d6d9cb1d.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/sct-d6d9cb1d.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/sct-d6d9cb1d.log.tar.gz) - **loader-set-d6d9cb1d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/loader-set-d6d9cb1d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/loader-set-d6d9cb1d.tar.gz) - **monitor-set-d6d9cb1d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/monitor-set-d6d9cb1d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/monitor-set-d6d9cb1d.tar.gz) - **parallel-timelines-report-d6d9cb1d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/parallel-timelines-report-d6d9cb1d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3/20240516_134322/parallel-timelines-report-d6d9cb1d.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tablets/job/longevity-100gb-4h-test/13/) [Argus](https://argus.scylladb.com/test/bfb2ef96-b741-43ea-9e7d-82e652c0ebdc/runs?additionalRuns[]=d6d9cb1d-7343-4283-b3b6-9e7e9dcf0cc3)

yarongilor commented 5 months ago

@fruch , @bhalevy , any suggestion how to address that? it is about either increase add-node timeout in few minutes or any other alternative conception.

fruch commented 5 months ago

@fruch , @bhalevy , any suggestion how to address that? it is about either increase add-node timeout in few minutes or any other alternative conception.

first things first, before anyone can say we need to change SCT timing about this operation, we need more information about when it's happening, I see this case is using tablets, so did you compare it to cases without tablets ? maybe it's some regression related to tablets (or something else) ?

we need to see when it's is starting from, and what is the actual root cause, in this case we have 100Gb of data, I don't see why adding a new node would take a whole hour ?

fruch commented 5 months ago

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5

I think most of them maps to those tablets runs

bhalevy commented 5 months ago

which node was it? what does clean_and_restart_bootstrap_after_abort do exactly? maybe it has pending migrations in progress that are lingering after abort if the new restarts with a new identity and we should remove the previous identity with nodetool removenode Cc @tgrabiec

yarongilor commented 5 months ago

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5

I think most of them maps to those tablets runs

thanks @fruch , some outputs: basically, the statistics shows add-node average duration for this test is 200 - 400 seconds. It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

fruch commented 5 months ago

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5 I think most of them maps to those tablets runs

thanks @fruch , some outputs: basically, the statistics shows add-node average duration for this test is 200 - 400 seconds. It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

cross check if those problematic runs are with tablets enabled, so yes should raise a scylla issue for it.

yarongilor commented 5 months ago

this is the dashboard of those operations in the last 90 days: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/3194a650-15c9-11ef-81c7-3986d18dafd5 I think most of them maps to those tablets runs

thanks @fruch , some outputs: basically, the statistics shows add-node average duration for this test is 200 - 400 seconds. It seems there's a significant degradation on runs with tablets, during the ~ last month - 10 times it got a > 3600 duration:

operation | duration | scylla_version
-- | -- | --
[past versions worst results]
NEW_NODE | 1317.537 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1212.539 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1386.438 | 2023.2.0~dev-0.20230404.85ad4482e7af with build-id c0ff5a80b9d10766ba9d6d681fc9893503c8f3f9
NEW_NODE | 1027.701 | 5.3.0~dev-0.20230405.f5de0582c8f8 with build-id 2fb65b5e36aee672ae31718200203f2f9a78eb6b
NEW_NODE | 1434.919 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1501.489 | 5.3.0~dev-0.20230407.c65bd01174c1 with build-id 30cf2f853017f7b98744f6ca55e1df0779aa6835
NEW_NODE | 1620.543 | 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2

[last month worst results]
NEW_NODE | 3620.362 | 5.5.0~dev-0.20240419.a5dae74aee4b with build-id 13e359437a7035b8b40bcd744a70b6abddbc491b
NEW_NODE | 3626.187 | 5.5.0~dev-0.20240321.900b56b117a1 with build-id b3a288bc50af2ce9c89155e61be692df05c17bf1
NEW_NODE | 3620.474 | 2024.2.0~dev-0.20240311.e59273035524 with build-id 9fb40ab71eca07f7202bb20528de3e0b83eaa0fa
NEW_NODE | 3620.25 | 5.5.0~dev-0.20240324.6bdb456fadfe with build-id 61f172ff915d5dd702951a0b7c0e83ac33712635
NEW_NODE | 3624.812 | 5.5.0~dev-0.20240329.885cb2af07b8 with build-id 4d1fc3fe8868b3d00a42f2b0b51f9953e9fa7346
NEW_NODE | 3624.183 | 5.5.0~dev-0.20240405.0c74c2c12fa4 with build-id ff77798ae700ea4f0d1bc44532624b330b89009a
NEW_NODE | 3617.742 | 5.5.0~dev-0.20240411.0be61e51d329 with build-id 54e018bb43eb2ae707aac127097747cd39532532
NEW_NODE | 3620.277 | 5.5.0~dev-0.20240501.af5674211dd4 with build-id aade02eeb28f730331d6170eaadf364591e0a779
NEW_NODE | 3619.354 | 5.5.0~dev-0.20240515.7b41bb601c53 with build-id 411dbd445918e9235d0bd963183ec77729d09e12
NEW_NODE | 3623.813 | 5.5.0~dev-0.20240517.db70e8dd5f6d with build-id 0a72505be9d1e39577898f683b53c462d3844ad8

Is it a known issue? should i check if something is changed in SCT? or report a scylla issue?

cross check if those problematic runs are with tablets enabled, so yes should raise a scylla issue for it.

The issue is reproduced in the exact scenario except no tablets. opened https://github.com/scylladb/scylladb/issues/18802

yarongilor commented 5 months ago

from 18802:

Opened documentation issue: #18850

Other than that, the SCT nemeses will have to be adjusted to account for node ban. @yarongilor @fruch

We similarly adjusted dtests, see for example: #17385, scylladb/scylla-dtest#3959

@aleksbykov , please mention your SCT fix PR for this issue.

yarongilor commented 5 months ago

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

fruch commented 5 months ago

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

What's misleading here exactly ?

yarongilor commented 5 months ago

@fruch , isn't this message wrong and misleading ES as well: message=operation 'NEW_NODE' took 3619.353437900543s and soft-timeout was set to 3600s ? do we want to fix this as well? so that ES would not be falsely reported for that?

What's misleading here exactly ?

operation 'NEW_NODE' took 3619.353437900543s - not really true (or meaningless). The bootstrap never ended.

yarongilor commented 4 months ago

Assumed to be fixed by Alex.

yarongilor commented 3 months ago

still reproduced (fix SCT PR not merged)-

2024-07-23 19:51:16.889: (SoftTimeoutEvent Severity.ERROR) period_type=one-time event_id=b2006713-cdb2-486f-a2de-4b3d1bcd3ab9 during_nemesis=BootstrapStreamingError, source=SoftTimeout message=operation 'NE
W_NODE' took 3606.5069193840027s and soft-timeout was set to 3600s
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 425, in run
    self.disrupt()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 6373, in disrupt
    self.call_next_nemesis()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1953, in call_next_nemesis
    self.execute_disrupt_method(disrupt_method=next(self.disruptions_cycle))
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1879, in execute_disrupt_method
    disrupt_method()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5213, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5045, in disrupt_bootstrap_streaming_error
    bootstrapabortmanager.clean_and_restart_bootstrap_after_abort()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/common.py", line 146, in clean_and_restart_bootstrap_after_abort
    with ignore_raft_topology_cmd_failing(), ignore_raft_transport_failing(), \
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/adaptive_timeouts/__init__.py", line 112, in adaptive_timeout
    SoftTimeoutEvent(operation=operation.name, soft_timeout=timeout, duration=duration).publish_or_dump()

Packages

Scylla version: 2024.2.0~rc1-20240715.9bbb8c1483d7 with build-id 963ed7f524fe09c74a07461967e282e2798783af

Kernel Version: 5.15.0-1064-aws

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

alternator-ttl-4-loaders-no-lwt-sis-db-node-0e9dbb39-6 (34.242.162.81 | 10.4.1.58) (shards: -1)
alternator-ttl-4-loaders-no-lwt-sis-db-node-0e9dbb39-5 (176.34.149.195 | 10.4.2.231) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-0e9dbb39-4 (54.217.71.11 | 10.4.2.190) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-0e9dbb39-3 (34.252.155.28 | 10.4.1.219) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-0e9dbb39-2 (63.32.95.90 | 10.4.0.236) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-0e9dbb39-1 (34.244.37.201 | 10.4.2.221) (shards: 14)

OS / Image: ami-0ec01ef14177e4d38 (aws: undefined_region)

Test: longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus-test Test id: 0e9dbb39-d5cb-45d8-8d84-97b1dc077838 Test name: enterprise-2024.2/alternator_tablets/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus-test Test config file(s):

longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 0e9dbb39-d5cb-45d8-8d84-97b1dc077838` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=0e9dbb39-d5cb-45d8-8d84-97b1dc077838) - Show all stored logs command: `$ hydra investigate show-logs 0e9dbb39-d5cb-45d8-8d84-97b1dc077838` ## Logs: - **db-cluster-0e9dbb39.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/db-cluster-0e9dbb39.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/db-cluster-0e9dbb39.tar.gz) - **email_data-0e9dbb39.json.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/email_data-0e9dbb39.json.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/email_data-0e9dbb39.json.tar.gz) - **output-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/output-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/output-0e9dbb39.log.tar.gz) - **debug-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/debug-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/debug-0e9dbb39.log.tar.gz) - **events-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/events-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/events-0e9dbb39.log.tar.gz) - **normal-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/normal-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/normal-0e9dbb39.log.tar.gz) - **argus-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/argus-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/argus-0e9dbb39.log.tar.gz) - **raw_events-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/raw_events-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/raw_events-0e9dbb39.log.tar.gz) - **critical-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/critical-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/critical-0e9dbb39.log.tar.gz) - **warning-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/warning-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/warning-0e9dbb39.log.tar.gz) - **summary-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/summary-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/summary-0e9dbb39.log.tar.gz) - **left_processes-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/left_processes-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/left_processes-0e9dbb39.log.tar.gz) - **error-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/error-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/error-0e9dbb39.log.tar.gz) - **sct-0e9dbb39.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/sct-0e9dbb39.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/sct-0e9dbb39.log.tar.gz) - **loader-set-0e9dbb39.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/loader-set-0e9dbb39.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/loader-set-0e9dbb39.tar.gz) - **monitor-set-0e9dbb39.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/monitor-set-0e9dbb39.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0e9dbb39-d5cb-45d8-8d84-97b1dc077838/20240724_100515/monitor-set-0e9dbb39.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2024.2/job/alternator_tablets/job/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus-test/2/) [Argus](https://argus.scylladb.com/test/cc3a6940-a949-4765-b703-d1e31fa5b208/runs?additionalRuns[]=0e9dbb39-d5cb-45d8-8d84-97b1dc077838)

scylladb / scylla-cluster-tests