Closed roydahan closed 1 year ago
Same problem here:
Installation details
Kernel version: 5.4.0-1035-aws
Scylla version (or git commit hash): 4.7.dev-0.20211118.4b1bb26d5 with build-id 85eacde12729cd0333f0835d801a5a6163a01276
Cluster size: 3 nodes (i3.2xlarge)
Scylla running with shards number (live nodes):
longevity-twcs-48h-master-db-node-74241d9e-1 (13.51.107.81 | 10.0.1.189): 8 shards
longevity-twcs-48h-master-db-node-74241d9e-7 (13.51.13.220 | 10.0.0.162): 8 shards
longevity-twcs-48h-master-db-node-74241d9e-9 (16.170.229.52 | 10.0.0.62): 8 shards
Scylla running with shards number (terminated nodes):
longevity-twcs-48h-master-db-node-74241d9e-2 (13.51.242.24 | 10.0.3.227): 8 shards
longevity-twcs-48h-master-db-node-74241d9e-5 (13.48.26.59 | 10.0.1.129): 8 shards
longevity-twcs-48h-master-db-node-74241d9e-4 (13.48.178.110 | 10.0.0.108): 8 shards
longevity-twcs-48h-master-db-node-74241d9e-6 (13.53.126.163 | 10.0.1.188): 8 shards
longevity-twcs-48h-master-db-node-74241d9e-8 (13.53.170.213 | 10.0.3.32): 8 shards
longevity-twcs-48h-master-db-node-74241d9e-3 (13.51.6.232 | 10.0.1.204): 8 shards
longevity-twcs-48h-master-db-node-74241d9e-10 (13.51.106.29 | 10.0.2.138): 8 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0231564df82a982e2
(aws: eu-north-1)
Test: longevity-twcs-48h-test
Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time
Test config file(s):
Issue description
====================================
PUT ISSUE DESCRIPTION HERE
====================================
Restore Monitor Stack command: $ hydra investigate show-monitor 74241d9e-92aa-4f02-b586-cee42631aba2
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 74241d9e-92aa-4f02-b586-cee42631aba2
Test id: 74241d9e-92aa-4f02-b586-cee42631aba2
Logs: critical - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/critical.log.tar.gz db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/db-cluster-74241d9e.tar.gz debug - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/debug.log.tar.gz email_data - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/email_data.json.tar.gz error - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/error.log.tar.gz event - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/events.log.tar.gz left_processes - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/left_processes.log.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/loader-set-74241d9e.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/monitor-set-74241d9e.tar.gz normal - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/normal.log.tar.gz output - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/output.log.tar.gz event - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/raw_events.log.tar.gz summary - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/summary.log.tar.gz warning - https://cloudius-jenkins-test.s3.amazonaws.com/74241d9e-92aa-4f02-b586-cee42631aba2/20211120_174345/warning.log.tar.gz
@ShlomiBalalis , seeing this issue on 2021.1.10 job (1TB job)... we need your advice and/or a manager issue
Reproduced on 2022.2 MV-SI:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 64, in wait_for
res = retry(func, **kwargs)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
do = self.iter(retry_state=retry_state)
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
raise retry_exc.reraise()
File "/usr/local/lib/python3.10/site-packages/tenacity/__init__.py", line 194, in reraise
raise self
tenacity.RetryError: RetryError[<Future at 0x7fd2356d0760 state=finished returned bool>]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3669, in wrapper
result = method(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2322, in disrupt_mgmt_backup
self._mgmt_backup(backup_specific_tables=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2363, in _mgmt_backup
status = mgr_task.wait_and_get_final_status(timeout=54000, step=5, only_final=True)
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 410, in wait_and_get_final_status
res = self.wait_for_status(list_status=list_final_status, timeout=timeout, step=step)
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 374, in wait_for_status
is_status_reached = wait.wait_for(func=self.is_status_in_list, step=step, throw_exc=True,
File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 75, in wait_for
raise tenacity.RetryError(err) from ex
tenacity.RetryError: RetryError[Wait for: Waiting until task: backup/b47a249d-699c-435e-99cc-7c2bb37ea109 reaches status of: ['ERROR (4/4)', 'DONE']: timeout - 54000 seconds - expired]
Kernel Version: 5.15.0-1019-aws
Scylla version (or git commit hash): 2022.2.0~rc2-20220919.75d087a2b75a
with build-id 463f1a57b82041a6c6b6441f0cbc26c8ad93091e
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc2.0.20220919.75d087a2b75a.tar.gz
Cluster size: 5 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image: ami-00bd31f22bcf5ae1a
(aws: eu-west-1)
Test: ics-longevity-mv-si-4days-test
Test id: 4a622274-af57-417f-a1ec-4cc4c89af60e
Test name: enterprise-2022.2/SCT_Enterprise_Features/ICS/ics-longevity-mv-si-4days-test
Test config file(s):
>>>>>>> Your description here... <<<<<<<
$ hydra investigate show-monitor 4a622274-af57-417f-a1ec-4cc4c89af60e
$ hydra investigate show-logs 4a622274-af57-417f-a1ec-4cc4c89af60e
@yarongilor backup task timeout, it can happen for lots of reasons, can you dig a bit into the log, getting the actual failure reason of the task ?
the size that was backup, if any was backed up.
And the rate of the backup (maybe this use case is very big, and backup isn't quick enough.
Just stating it's timeout, won't get it solved.
This error is reproduced in 2022.1-rc2 and not reproduced in later runs, and all manage backup nemesis passed ok in Scylla version: 2022.1~rc7-20220602.7abea3aad see in: https://argus.scylladb.com/test/1e612110-b61b-4e45-839a-df1a3e430a6c/runs?additionalRuns[]=95a38c88-1afa-4bf5-9b3f-243358224522
so the issue can be closed.
This error is reproduced in 2022.1-rc2 and not reproduced in later runs, and all manage backup nemesis passed ok in Scylla version: 2022.1~rc7-20220602.7abea3aad see in: https://argus.scylladb.com/test/1e612110-b61b-4e45-839a-df1a3e430a6c/runs?additionalRuns[]=95a38c88-1afa-4bf5-9b3f-243358224522
so the issue can be closed.
based on what exactly ? did you ever looked into the logs, what was the failure ?
it happened on multiple test-cases: https://70f106c98484448dbc4705050eb3f7e9.us-east-1.aws.found.io:9243/goto/a31d9825872d8e2115bd0d947c0b3ad2
bugs are not getting solved on there own...
@fruch , the multiple test-cases does show it is not reproduced after rc2. i opened a new scylla issue anyway: https://github.com/scylladb/scylla-manager/issues/3389
@fruch , the multiple test-cases does show it is not reproduced after rc2. i opened a new scylla issue anyway: scylladb/scylla-manager#3389
@yarongilor , if the failure was manager backup, why is the issue on scylla.git repo and not scylla-manager.git?
@ShlomiBalalis @yarongilor ?
@ShlomiBalalis @yarongilor ?
@fgelcer , IIRC, it was related to a manager bug with snapshot that is fixed on manager side.
@fgelcer , IIRC, it was related to a manager bug with snapshot that is fixed on manager side.
https://github.com/scylladb/scylla-cluster-tests/issues/4141#issuecomment-1308651863
@fgelcer , IIRC, it was related to a manager bug with snapshot that is fixed on manager side.
https://github.com/scylladb/scylla-cluster-tests/issues/4141#issuecomment-1308651863
There's a discussion on that ticket, pointing out that it's a manger issue deferred to 3.1, ask on that ticket that it should be moved to manager repo.
Anyhow I'm closing this honey pot issue.
We should have closed it a long long time ago, and I'm almost sure the issue from 1y ago isn't necessarily the one raised by yaron, just a similar side effect.
@ShlomiBalalis , can you please advice - i see this error now in master 200gb longevity:
is it a different issue?
Installation details Kernel version:
5.4.0-1035-aws
Scylla version (or git commit hash):4.6.dev-0.20211107.4950ce539 with build-id 8d7a7f85964575a8775f8358cef3ff74141e07fd
Cluster size: 4 nodes (i3.4xlarge) Scylla running with shards number (live nodes): longevity-200gb-48h-verify-limited--db-node-22e086a5-1 (13.53.161.246 | 10.0.3.162): 14 shards longevity-200gb-48h-verify-limited--db-node-22e086a5-3 (13.53.123.18 | 10.0.0.75): 14 shards longevity-200gb-48h-verify-limited--db-node-22e086a5-4 (13.49.65.248 | 10.0.0.221): 14 shards longevity-200gb-48h-verify-limited--db-node-22e086a5-5 (16.170.159.255 | 10.0.0.232): 14 shards Scylla running with shards number (terminated nodes): longevity-200gb-48h-verify-limited--db-node-22e086a5-2 (13.53.122.101 | 10.0.1.223): 14 shards OS (RHEL/CentOS/Ubuntu/AWS AMI):ami-0ac0c6c9087bf4b09
(aws: eu-north-1)Test:
longevity-200gb-48h
Test name:longevity_test.LongevityTest.test_custom_time
Test config file(s):Issue description
====================================
PUT ISSUE DESCRIPTION HERE
====================================
Restore Monitor Stack command:
$ hydra investigate show-monitor 22e086a5-c3b0-495f-b061-0626f4356154
Restore monitor on AWS instance using Jenkins job Show all stored logs command:$ hydra investigate show-logs 22e086a5-c3b0-495f-b061-0626f4356154
Test id:
22e086a5-c3b0-495f-b061-0626f4356154
Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/22e086a5-c3b0-495f-b061-0626f4356154/20211110_220939/db-cluster-22e086a5.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/22e086a5-c3b0-495f-b061-0626f4356154/20211110_220939/loader-set-22e086a5.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/22e086a5-c3b0-495f-b061-0626f4356154/20211110_220939/monitor-set-22e086a5.tar.gz sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/22e086a5-c3b0-495f-b061-0626f4356154/20211110_220939/sct-runner-22e086a5.tar.gz
Jenkins job URL
Originally posted by @yarongilor in https://github.com/scylladb/scylla-cluster-tests/issues/3902#issuecomment-969142502