manager backup nemesis [timeout - 54000 seconds - expired]

roydahan commented 2 years ago

@ShlomiBalalis , can you please advice - i see this error now in master 200gb longevity:


MgmtBackup | 1 | nodestartendduration | node | start | end | duration | nodestartenddurationerrorNode longevity-200gb-48h-verify-limited--db-node-22e086a5-4 [13.49.65.248 \| 10.0.0.221] (seed: False)2021-11-09 11:44:342021-11-10 02:45:0054025RetryError[Wait for: Waiting until task: backup/ca00591c-3359-4dc2-bfb1-ff4b2ef1de03 reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired] | node | start | end | duration | error | Node longevity-200gb-48h-verify-limited--db-node-22e086a5-4 [13.49.65.248 \| 10.0.0.221] (seed: False) | 2021-11-09 11:44:34 | 2021-11-10 02:45:00 | 54025 | RetryError[Wait for: Waiting until task: backup/ca00591c-3359-4dc2-bfb1-ff4b2ef1de03 reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired]
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

Node longevity-200gb-48h-verify-limited--db-node-22e086a5-4 [13.49.65.248 \| 10.0.0.221] (seed: False) | 2021-11-09 11:44:34 | 2021-11-10 02:45:00 | 54025 | RetryError[Wait for: Waiting until task: backup/ca00591c-3359-4dc2-bfb1-ff4b2ef1de03 reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired]

is it a different issue?

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.6.dev-0.20211107.4950ce539 with build-id 8d7a7f85964575a8775f8358cef3ff74141e07fd Cluster size: 4 nodes (i3.4xlarge) Scylla running with shards number (live nodes): longevity-200gb-48h-verify-limited--db-node-22e086a5-1 (13.53.161.246 | 10.0.3.162): 14 shards longevity-200gb-48h-verify-limited--db-node-22e086a5-3 (13.53.123.18 | 10.0.0.75): 14 shards longevity-200gb-48h-verify-limited--db-node-22e086a5-4 (13.49.65.248 | 10.0.0.221): 14 shards longevity-200gb-48h-verify-limited--db-node-22e086a5-5 (16.170.159.255 | 10.0.0.232): 14 shards Scylla running with shards number (terminated nodes): longevity-200gb-48h-verify-limited--db-node-22e086a5-2 (13.53.122.101 | 10.0.1.223): 14 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0ac0c6c9087bf4b09 (aws: eu-north-1)

Test: longevity-200gb-48h Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

longevity-200GB-48h-verifier-LimitedMonkey-tls.yaml

Issue description

====================================

PUT ISSUE DESCRIPTION HERE

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 22e086a5-c3b0-495f-b061-0626f4356154 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs 22e086a5-c3b0-495f-b061-0626f4356154

Test id: 22e086a5-c3b0-495f-b061-0626f4356154

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/22e086a5-c3b0-495f-b061-0626f4356154/20211110_220939/db-cluster-22e086a5.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/22e086a5-c3b0-495f-b061-0626f4356154/20211110_220939/loader-set-22e086a5.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/22e086a5-c3b0-495f-b061-0626f4356154/20211110_220939/monitor-set-22e086a5.tar.gz sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/22e086a5-c3b0-495f-b061-0626f4356154/20211110_220939/sct-runner-22e086a5.tar.gz

Jenkins job URL

Originally posted by @yarongilor in https://github.com/scylladb/scylla-cluster-tests/issues/3902#issuecomment-969142502

juliayakovlev commented 2 years ago

Same problem here:

Installation details Kernel version: 5.4.0-1035-aws Scylla version (or git commit hash): 4.7.dev-0.20211118.4b1bb26d5 with build-id 85eacde12729cd0333f0835d801a5a6163a01276 Cluster size: 3 nodes (i3.2xlarge) Scylla running with shards number (live nodes): longevity-twcs-48h-master-db-node-74241d9e-1 (13.51.107.81 | 10.0.1.189): 8 shards longevity-twcs-48h-master-db-node-74241d9e-7 (13.51.13.220 | 10.0.0.162): 8 shards longevity-twcs-48h-master-db-node-74241d9e-9 (16.170.229.52 | 10.0.0.62): 8 shards Scylla running with shards number (terminated nodes): longevity-twcs-48h-master-db-node-74241d9e-2 (13.51.242.24 | 10.0.3.227): 8 shards longevity-twcs-48h-master-db-node-74241d9e-5 (13.48.26.59 | 10.0.1.129): 8 shards longevity-twcs-48h-master-db-node-74241d9e-4 (13.48.178.110 | 10.0.0.108): 8 shards longevity-twcs-48h-master-db-node-74241d9e-6 (13.53.126.163 | 10.0.1.188): 8 shards longevity-twcs-48h-master-db-node-74241d9e-8 (13.53.170.213 | 10.0.3.32): 8 shards longevity-twcs-48h-master-db-node-74241d9e-3 (13.51.6.232 | 10.0.1.204): 8 shards longevity-twcs-48h-master-db-node-74241d9e-10 (13.51.106.29 | 10.0.2.138): 8 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0231564df82a982e2 (aws: eu-north-1)

Test: longevity-twcs-48h-test Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time Test config file(s):

longevity-twcs-48h.yaml

Issue description

====================================

PUT ISSUE DESCRIPTION HERE

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 74241d9e-92aa-4f02-b586-cee42631aba2 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs 74241d9e-92aa-4f02-b586-cee42631aba2

Test id: 74241d9e-92aa-4f02-b586-cee42631aba2

Jenkins job URL

fgelcer commented 2 years ago

@ShlomiBalalis , seeing this issue on 2021.1.10 job (1TB job)... we need your advice and/or a manager issue

yarongilor commented 1 year ago

Reproduced on 2022.2 MV-SI:


Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 64, in wait_for
    res = retry(func, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
    raise retry_exc.reraise()
  File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
    raise self
tenacity.RetryError: RetryError[<Future at 0x7fd2356d0760 state=finished returned bool>]

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3669, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2322, in disrupt_mgmt_backup
    self._mgmt_backup(backup_specific_tables=False)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2363, in _mgmt_backup
    status = mgr_task.wait_and_get_final_status(timeout=54000, step=5, only_final=True)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 410, in wait_and_get_final_status
    res = self.wait_for_status(list_status=list_final_status, timeout=timeout, step=step)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 374, in wait_for_status
    is_status_reached = wait.wait_for(func=self.is_status_in_list, step=step, throw_exc=True,
  File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 75, in wait_for
    raise tenacity.RetryError(err) from ex
tenacity.RetryError: RetryError[Wait for: Waiting until task: backup/b47a249d-699c-435e-99cc-7c2bb37ea109 reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired]

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 2022.2.0~rc2-20220919.75d087a2b75a with build-id 463f1a57b82041a6c6b6441f0cbc26c8ad93091e Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc2.0.20220919.75d087a2b75a.tar.gz Cluster size: 5 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-mv-si-4d-2022-2-db-node-4a622274-9 (34.241.80.176 | 10.4.2.114) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-8 (34.244.72.179 | 10.4.3.19) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-7 (54.75.8.156 | 10.4.0.116) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-6 (52.50.165.228 | 10.4.3.76) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-5 (34.247.163.201 | 10.4.0.224) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-4 (3.250.153.245 | 10.4.1.64) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-3 (34.241.21.128 | 10.4.0.248) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-2 (63.32.45.197 | 10.4.3.83) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-14 (54.194.149.238 | 10.4.2.43) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-13 (54.75.69.155 | 10.4.3.195) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-12 (54.246.37.34 | 10.4.1.150) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-11 (54.154.54.123 | 10.4.1.231) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-10 (34.247.156.237 | 10.4.0.167) (shards: 14)
longevity-mv-si-4d-2022-2-db-node-4a622274-1 (3.250.23.185 | 10.4.0.254) (shards: 14)

OS / Image: ami-00bd31f22bcf5ae1a (aws: eu-west-1)

Test: ics-longevity-mv-si-4days-test Test id: 4a622274-af57-417f-a1ec-4cc4c89af60e Test name: enterprise-2022.2/SCT_Enterprise_Features/ICS/ics-longevity-mv-si-4days-test Test config file(s):

longevity-mv-si-4days.yaml

Issue description

>>>>>>> Your description here... <<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor 4a622274-af57-417f-a1ec-4cc4c89af60e
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 4a622274-af57-417f-a1ec-4cc4c89af60e

Logs:

db-cluster-4a622274.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/db-cluster-4a622274.tar.gz
monitor-set-4a622274.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/monitor-set-4a622274.tar.gz
loader-set-4a622274.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/loader-set-4a622274.tar.gz
normal-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/normal-4a622274.log.tar.gz
summary-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/summary-4a622274.log.tar.gz
events-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/events-4a622274.log.tar.gz
output-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/output-4a622274.log.tar.gz
debug-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/debug-4a622274.log.tar.gz
sct-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/sct-4a622274.log.tar.gz
error-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/error-4a622274.log.tar.gz
critical-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/critical-4a622274.log.tar.gz
raw_events-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/raw_events-4a622274.log.tar.gz
warning-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/warning-4a622274.log.tar.gz
email_data-4a622274.json.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/email_data-4a622274.json.tar.gz
argus-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/argus-4a622274.log.tar.gz
left_processes-4a622274.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/4a622274-af57-417f-a1ec-4cc4c89af60e/20221001_132903/left_processes-4a622274.log.tar.gz

Jenkins job URL

fruch commented 1 year ago

@yarongilor backup task timeout, it can happen for lots of reasons, can you dig a bit into the log, getting the actual failure reason of the task ?

the size that was backup, if any was backed up.

And the rate of the backup (maybe this use case is very big, and backup isn't quick enough.

Just stating it's timeout, won't get it solved.

yarongilor commented 1 year ago

This error is reproduced in 2022.1-rc2 and not reproduced in later runs, and all manage backup nemesis passed ok in Scylla version: 2022.1~rc7-20220602.7abea3aad see in: https://argus.scylladb.com/test/1e612110-b61b-4e45-839a-df1a3e430a6c/runs?additionalRuns[]=95a38c88-1afa-4bf5-9b3f-243358224522

so the issue can be closed.

fruch commented 1 year ago

This error is reproduced in 2022.1-rc2 and not reproduced in later runs, and all manage backup nemesis passed ok in Scylla version: 2022.1~rc7-20220602.7abea3aad see in: https://argus.scylladb.com/test/1e612110-b61b-4e45-839a-df1a3e430a6c/runs?additionalRuns[]=95a38c88-1afa-4bf5-9b3f-243358224522

so the issue can be closed.

based on what exactly ? did you ever looked into the logs, what was the failure ?

it happened on multiple test-cases: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/a31d9825872d8e2115bd0d947c0b3ad2

bugs are not getting solved on there own...

yarongilor commented 1 year ago

@fruch , the multiple test-cases does show it is not reproduced after rc2. i opened a new scylla issue anyway: https://github.com/scylladb/scylla-manager/issues/3389

fgelcer commented 1 year ago

@fruch , the multiple test-cases does show it is not reproduced after rc2. i opened a new scylla issue anyway: scylladb/scylla-manager#3389

@yarongilor , if the failure was manager backup, why is the issue on scylla.git repo and not scylla-manager.git?

fgelcer commented 1 year ago

@ShlomiBalalis @yarongilor ?

fgelcer commented 1 year ago

@ShlomiBalalis @yarongilor ?

yarongilor commented 1 year ago

@fgelcer , IIRC, it was related to a manager bug with snapshot that is fixed on manager side.

fgelcer commented 1 year ago

@fgelcer , IIRC, it was related to a manager bug with snapshot that is fixed on manager side.

https://github.com/scylladb/scylla-cluster-tests/issues/4141#issuecomment-1308651863

fruch commented 1 year ago

@fgelcer , IIRC, it was related to a manager bug with snapshot that is fixed on manager side.

https://github.com/scylladb/scylla-cluster-tests/issues/4141#issuecomment-1308651863

There's a discussion on that ticket, pointing out that it's a manger issue deferred to 3.1, ask on that ticket that it should be moved to manager repo.

Anyhow I'm closing this honey pot issue.

We should have closed it a long long time ago, and I'm almost sure the issue from 1y ago isn't necessarily the one raised by yaron, just a similar side effect.

scylladb / scylla-cluster-tests