sctool restore failed with error: "failed to open source object: object not found'

juliayakovlev commented 8 months ago

Issue description

[ ] This issue is a regression.
[x] It is unknown if this issue is a regression.

MgmtRestore nemesis failed with error:

< t:2024-01-13 00:13:30,049 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command 
"sudo sctool restore -c 526e048f-bb21-4f5c-a8d5-037023bf7467 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230702235739UTC"...

Jan 13 02:44:47 longevity-twcs-48h-2023-1-monitor-node-54645511-1 scylla-manager[13653]: {"L":"INFO","T":"2024-01-13T02:44:47.897Z","N":"scheduler.526e048f","M":
"Run ended with ERROR","task":"restore/5674514c-c882-4537-b66e-afc451552bde","status":"ERROR",
"cause":"not restored bundles [138]: restore batch: wait for job: job error (1705094012): failed to open source object: object not found","duration":"11m24.728567216s","_trace_id":"5slN6cw0Reaodl99ZUoP3A"}

Client version: 3.2.5-0.20231206.8b378dea Server version: 3.2.5-0.20231206.8b378dea

Impact

sctool restore failed. No other impact observes

How frequently does it reproduce?

Found this issue. Not sure if it the same / similar

Installation details

Kernel Version: 5.15.0-1051-aws Scylla version (or git commit hash): 2023.1.4-20240112.12c616e7f0cf with build-id e7263a4aa92cf866b98cf680bd68d7198c9690c0

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-2023-1-db-node-54645511-6 (34.207.151.200 | 10.12.10.33) (shards: -1)
longevity-twcs-48h-2023-1-db-node-54645511-5 (54.227.90.172 | 10.12.11.162) (shards: -1)
longevity-twcs-48h-2023-1-db-node-54645511-4 (54.226.225.25 | 10.12.8.132) (shards: 7)
longevity-twcs-48h-2023-1-db-node-54645511-3 (3.85.108.8 | 10.12.11.110) (shards: 7)
longevity-twcs-48h-2023-1-db-node-54645511-2 (34.229.155.70 | 10.12.9.204) (shards: 7)
longevity-twcs-48h-2023-1-db-node-54645511-1 (34.234.63.67 | 10.12.10.112) (shards: 7)

OS / Image: ami-08b5f8ff1565ab9f0 (aws: undefined_region)

Test: longevity-twcs-48h-test Test id: 54645511-775e-4d02-8fd8-35a38a4a2df8 Test name: enterprise-2023.1/longevity/longevity-twcs-48h-test Test config file(s):

longevity-twcs-48h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 54645511-775e-4d02-8fd8-35a38a4a2df8` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=54645511-775e-4d02-8fd8-35a38a4a2df8) - Show all stored logs command: `$ hydra investigate show-logs 54645511-775e-4d02-8fd8-35a38a4a2df8` ## Logs: - **db-cluster-54645511.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/db-cluster-54645511.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/db-cluster-54645511.tar.gz) - **sct-runner-events-54645511.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/sct-runner-events-54645511.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/sct-runner-events-54645511.tar.gz) - **sct-54645511.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/sct-54645511.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/sct-54645511.log.tar.gz) - **loader-set-54645511.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/loader-set-54645511.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/loader-set-54645511.tar.gz) - **monitor-set-54645511.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/monitor-set-54645511.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/monitor-set-54645511.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/longevity/job/longevity-twcs-48h-test/7/) [Argus](https://argus.scylladb.com/test/3c965e5e-a758-4f96-9a5d-2ad5a58921bb/runs?additionalRuns[]=54645511-775e-4d02-8fd8-35a38a4a2df8)

Michal-Leszczynski commented 8 months ago

Found https://github.com/scylladb/scylladb/issues/16321. Not sure if it the same / similar

I don't think that's a similar issue. Mentioned issue had a problem with restoring schema multiple times on the same cluster which is not supported. I haven't seen this in this issue.

It looks like file me-138-big-Index.db is present in SM manifest, but it's missing in backup location and that causes restore to fail.

From SM logs it looks like the test scenario goes like this:

make backup
restore schema
restore tables

But the strange thing is that backup generates snapshot tag sm_20240112221504UTC, but both restores use snapshot tag sm_20230702235739UTC. Is this expected? Where does the snapshot tag used for restore comes from and is there a chance that this backup is broken (misses s3:manager-backup-tests-permanent-snapshots-us-east-1/backup/sst/cluster/0f0f556f-eb17-4012-b39c-f99a35828c04/dc/us-east/node/15430605-a376-4758-9205-014ab34ad5d5/keyspace/100gb_sizetiered_2022_2/table/standard1/07206f60192311eea6af23bef1a3e064/me-138-big-Index.db)?

Michal-Leszczynski commented 8 months ago

I validated that this file is indeed missing from the s3 dir, so it's either a problem with a test (using predefined backup instead of the fresh one) or just a problem with predefined backup that's not part of the test. @juliayakovlev can we close this issue?

juliayakovlev commented 8 months ago

@ShlomiBalalis can you see that, please

mykaul commented 8 months ago

@juliayakovlev , @ShlomiBalalis - any updates?

juliayakovlev commented 8 months ago

@juliayakovlev , @ShlomiBalalis - any updates?

@ShlomiBalalis can you advice, please?

ShlomiBalalis commented 8 months ago

Hi! Sorry for the long silence Yes, the file is missing, but I can't say for certain if it was missing in the first place, ever since we created the backup, or somewhere down the road. There is no Lifecycle rule that would cause this file to be deleted, so if it was properly created in the first place, I don't know how it went missing. I'll try to find the logs of the original run to see if it will be of any help

I validated that this file is indeed missing from the s3 dir, so it's either a problem with a test (using predefined backup instead of the fresh one) or just a problem with predefined backup that's not part of the test. @juliayakovlev can we close this issue?

The file was created over six months ago as part of another test run. Would that be a problem?

Michal-Leszczynski commented 8 months ago

The file was created over six months ago as part of another test run. Would that be a problem?

SM should have no problem with restoring old backups.

juliayakovlev commented 7 months ago

@ShlomiBalalis any news? It continues to fail.

Michal-Leszczynski commented 7 months ago

@ShlomiBalalis ping

Michal-Leszczynski commented 6 months ago

@mikliapko is this something that you could take care of? I mean validating if this is a problem with some incomplete, cached backup or is it an actual issue.

scylladb / scylla-manager