scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Investigate ubuntu22-sanity job instability #3865

Closed mikliapko closed 4 weeks ago

mikliapko commented 1 month ago

Several failures by timeout in checks related to backup status waiters were observed in latest runs:

https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/ubuntu22-sanity-test/446/

22:10:20  sdcm.exceptions.WaitForTimeoutError: Wait for: Waiting until task: backup/2660600c-0166-448a-a0aa-3e3b7315cc37 reaches status of: ['STOPPED']: timeout - 60 seconds - expired
22:10:20  The above exception was the direct cause of the following exception:
22:10:20  Traceback (most recent call last):
22:10:20  File "/home/ubuntu/scylla-cluster-tests/mgmt_cli_test.py", line 1067, in test_suspend_and_resume
22:10:20  self._template_suspend_with_on_resume_start_tasks_flag(wait_for_duration=True)
22:10:20  File "/home/ubuntu/scylla-cluster-tests/mgmt_cli_test.py", line 1110, in _template_suspend_with_on_resume_start_tasks_flag
22:10:20  assert suspendable_task.wait_for_status(list_status=[TaskStatus.STOPPED], timeout=60, step=2), \
22:10:20  File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 405, in wait_for_status
22:10:20  raise WaitForTimeoutError(
22:10:20  sdcm.exceptions.WaitForTimeoutError: Failed on waiting until task: backup/2660600c-0166-448a-a0aa-3e3b7315cc37 reaches status of ['STOPPED']: current task status DONE: Wait for: Waiting until task: backup/2660600c-0166-448a-a0aa-3e3b7315cc37 reaches status of: ['STOPPED']: timeout - 60 seconds - expired

https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/ubuntu22-sanity-test/449/

22:19:28  sdcm.exceptions.WaitForTimeoutError: Wait for: Waiting until task: backup/ce2e4509-146d-4a20-ac4a-7c77b1b09606 reaches status of: ['RUNNING']: timeout - 300 seconds - expired
22:19:28  The above exception was the direct cause of the following exception:
22:19:28  Traceback (most recent call last):
22:19:28  File "/home/ubuntu/scylla-cluster-tests/mgmt_cli_test.py", line 1067, in test_suspend_and_resume
22:19:28  self._template_suspend_with_on_resume_start_tasks_flag(wait_for_duration=True)
22:19:28  File "/home/ubuntu/scylla-cluster-tests/mgmt_cli_test.py", line 1106, in _template_suspend_with_on_resume_start_tasks_flag
22:19:28  assert suspendable_task.wait_for_status(list_status=[TaskStatus.RUNNING], timeout=300, step=5), \
22:19:28  File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 405, in wait_for_status
22:19:28  raise WaitForTimeoutError(
22:19:28  sdcm.exceptions.WaitForTimeoutError: Failed on waiting until task: backup/ce2e4509-146d-4a20-ac4a-7c77b1b09606 reaches status of ['RUNNING']: current task status DONE: Wait for: Waiting until task: backup/ce2e4509-146d-4a20-ac4a-7c77b1b09606 reaches status of: ['RUNNING']: timeout - 300 seconds - expired

The tests should be fixed to eliminate these failures.

mikliapko commented 1 month ago

4 from 5 latest jobs failed, I suppose we need to take a look ASAP https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/ubuntu22-sanity-test/