scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
51 stars 33 forks source link

Restoring a 3.1 backup with 3.2 fails, with files from the bucket appearing to be missing #3548

Open ShlomiBalalis opened 1 year ago

ShlomiBalalis commented 1 year ago

Issue description

At first, while using manager 3.1.2, we create a backupm which was successful:

< t:2023-08-16 19:54:15,367 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool backup -c 4105638a-e5a7-4c68-b8c2-614904a59131 --keyspace keyspace1  --location s3:manager-backup-tests-us-east-1  --cron '56 19 * * *' " finished with status 0
...
< t:2023-08-16 19:56:24,300 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool  -c 4105638a-e5a7-4c68-b8c2-614904a59131 progress backup/6d118b0f-2652-4258-a797-9a191b0b26f2"...
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Run:            ec1624fd-3c6e-11ee-bd30-02b041669e61
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status:         DONE
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Start time:     16 Aug 23 19:56:00 UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > End time:       16 Aug 23 19:56:12 UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Duration:       12s
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Progress:       100%
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Snapshot Tag:   sm_20230816195600UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenters:    
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   - us-east
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╭─────────────┬──────────┬──────────┬──────────┬──────────────┬────────╮
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ Host        │ Progress │     Size │  Success │ Deduplicated │ Failed │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ├─────────────┼──────────┼──────────┼──────────┼──────────────┼────────┤
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.0.108 │     100% │ 283.379M │ 283.379M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.1.151 │     100% │ 283.379M │ 283.379M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.1.249 │     100% │ 283.377M │ 283.377M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╰─────────────┴──────────┴──────────┴──────────┴──────────────┴────────╯

Afterwards, we upgrade to 3.2.0:

< t:2023-08-16 20:01:12,692 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool version"...
< t:2023-08-16 20:01:12,770 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Client version: 3.2.0-0.20230815.9db309c5
< t:2023-08-16 20:01:12,772 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Server version: 3.2.0-0.20230815.9db309c5

And attempt to restore the aforementioned backup:

< t:2023-08-16 20:05:57,707 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c 4105638a-e5a7-4c68-b8c2-614904a59131 --restore-tables --location s3:manager-backup-tests-us-east-1  --snapshot-tag sm_20230816195600UTC" finished with status 0
< t:2023-08-16 20:05:57,708 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: restore/03db246a-0533-4a91-9079-9ecdf41884f2

But the restore fails, stating that there are missing files in the bucket:

< t:2023-08-16 20:06:35,842 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c 4105638a-e5a7-4c68-b8c2-614904a59131 progress restore/03db246a-0533-4a91-9079-9ecdf41884f2" finished with status 0
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Restore progress
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Run:              500d5a78-3c70-11ee-8668-02b041669e61
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Status:           ERROR (restoring backed-up data)
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Cause:            not restored bundles [3 2]: wait on rclone job, id: 1692216016, host
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG >  10.12.0.108: job error (1692216016): failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Data.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Statistics.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Summary.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-TOC.txt failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-CRC.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Data.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Digest.crc32 failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Filter.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Index.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Scylla.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Statistics.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Summary.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-TOC.txt failed to open source object: object not found
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Start time:       16 Aug 23 20:05:57 UTC
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > End time: 16 Aug 23 20:06:05 UTC
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Duration: 8s
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Progress: 0% | 0%
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230816195600UTC
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────┬──────────┬──────────┬─────────┬────────────┬────────╮
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace  │ Progress │     Size │ Success │ Downloaded │ Failed │
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > ├───────────┼──────────┼──────────┼─────────┼────────────┼────────┤
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > │ keyspace2 │  0% | 0% │ 849.762M │       0 │          0 │      0 │
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > │ keyspace1 │  0% | 0% │ 849.861M │       0 │          0 │      0 │
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────┴──────────┴──────────┴─────────┴────────────┴────────╯

To summerize, the files that the restore task reports as missing are:

backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Data.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Statistics.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Summary.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-TOC.txt
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-CRC.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Data.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Digest.crc32
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Filter.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Index.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Scylla.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Statistics.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Summary.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-TOC.txt

Which do seem to be missing from the actual bucket: https://s3.console.aws.amazon.com/s3/buckets/manager-backup-tests-us-east-1?region=us-east-1&prefix=backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/&showversions=false Screenshot from 2023-08-30 05-34-21

Impact

Restoring older backups would not be possilbe

Installation details

Kernel Version: 5.15.0-1034-aws Scylla version (or git commit hash): 5.0.13-20230423.a0ca8abe4 with build-id 1196f46d1be44927c95ac67703df3ab2ef98a332

Initial manager version: 3.1.2-0.20230704.bd349aa4 Manager version after upgrade: 3.2.0-0.20230815.9db309c5

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

OS / Image: ami-0659419f07e694eaf (aws: undefined_region)

Test: ubuntu20-upgrade-test Test id: c60a1216-9e68-4134-a42a-a34ce6a71a45 Test name: manager-3.2/ubuntu20-upgrade-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor c60a1216-9e68-4134-a42a-a34ce6a71a45` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=c60a1216-9e68-4134-a42a-a34ce6a71a45) - Show all stored logs command: `$ hydra investigate show-logs c60a1216-9e68-4134-a42a-a34ce6a71a45` ## Logs: - **db-cluster-c60a1216.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/db-cluster-c60a1216.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/db-cluster-c60a1216.tar.gz) - **sct-runner-events-c60a1216.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/sct-runner-events-c60a1216.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/sct-runner-events-c60a1216.tar.gz) - **sct-c60a1216.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/sct-c60a1216.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/sct-c60a1216.log.tar.gz) - **loader-set-c60a1216.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/loader-set-c60a1216.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/loader-set-c60a1216.tar.gz) - **monitor-set-c60a1216.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/monitor-set-c60a1216.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/monitor-set-c60a1216.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/manager-3.2/job/ubuntu20-upgrade-test/1/) [Argus](https://argus.scylladb.com/test/f7c7d71e-6dc1-475b-bb95-6589d4490f37/runs?additionalRuns[]=c60a1216-9e68-4134-a42a-a34ce6a71a45)
mykaul commented 1 year ago
  1. Scylla 5.0.13 is not supported anymore.
  2. So if the files are not in the bucket, where are they? (who deleted them, or did not back them up properly in the 1st place) ?
Michal-Leszczynski commented 1 year ago
< t:2023-08-16 19:54:15,367 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool backup -c 4105638a-e5a7-4c68-b8c2-614904a59131 --keyspace keyspace1  --location s3:manager-backup-tests-us-east-1  --cron '56 19 * * *' " finished with status 0

Well, only keyspace1 is being backed-up, so that's not surprising that files from keyspace2 are missing. I'm investigating why would SM try to restore keyspace2 if it wasn't the part of the backup.

Michal-Leszczynski commented 1 year ago

What is interesting here, is that the backup bucket contains two different clusters 4105638a-e5a7-4c68-b8c2-614904a59131, 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 and both of them have a backup with the same snapshot tag sm_20230816195600UTC. This is possible only if there are 2 instances of SM running backup two different clusters are being backed-up into the same remote location at the very same time (up to 1 second backup start time diff).

In this test scenario, provided SM logs shows that SM backed up cluster 4105638a-e5a7-4c68-b8c2-614904a59131 with snapshot tag sm_20230816195600UTC, but the following restore tried to restore backup from cluster 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 with the same snapshot tag.

But even though, I would expect that restore would end successfully with the result of restoring data from this second cluster, so it's still puzzling why files from this second cluster are missing.

@ShlomiBalalis could you verify, if what I'm saying make sense? Moreover, please describe what tests were conducted on this second cluster (0fa19c7a-d21a-4bff-9f19-80c33dd2ae15), because perhaps those tests scenarios are actually responsible for deleting those files.

Michal-Leszczynski commented 1 year ago

@ShlomiBalalis ping

fgelcer commented 1 year ago

@ShlomiBalalis

karol-kokoszka commented 1 year ago

@ShlomiBalalis could you please reply to the comment https://github.com/scylladb/scylla-manager/issues/3548#issuecomment-1699124738 ? We need to understand if there is really a bug in manager and if we should include it into 3.2.2.

ShlomiBalalis commented 1 year ago

What is interesting here, is that the backup bucket contains two different clusters 4105638a-e5a7-4c68-b8c2-614904a59131, 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 and both of them have a backup with the same snapshot tag sm_20230816195600UTC. This is possible only if ~there are 2 instances of SM running backup~ two different clusters are being backed-up into the same remote location at the very same time (up to 1 second backup start time diff).

In this test scenario, provided SM logs shows that SM backed up cluster 4105638a-e5a7-4c68-b8c2-614904a59131 with snapshot tag sm_20230816195600UTC, but the following restore tried to restore backup from cluster 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 with the same snapshot tag.

But even though, I would expect that restore would end successfully with the result of restoring data from this second cluster, so it's still puzzling why files from this second cluster are missing.

@ShlomiBalalis could you verify, if what I'm saying make sense? Moreover, please describe what tests were conducted on this second cluster (0fa19c7a-d21a-4bff-9f19-80c33dd2ae15), because perhaps those tests scenarios are actually responsible for deleting those files.

I have no idea where did the second cluster come from. We only executed one cluster add in this test, which gave us 4105638a-e5a7-4c68-b8c2-614904a59131:

< t:2023-08-16 19:50:38,371 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool cluster add --host=10.12.1.249  --name=cluster_under_test --auth-token c60a1216-9e68-4134-a42a-a34ce6a71a45" finished with status 0
< t:2023-08-16 19:50:38,371 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: 4105638a-e5a7-4c68-b8c2-614904a59131

I see that the second cluster is only mentioned with regard to the restore/03db246a-0533-4a91-9079-9ecdf41884f2 task, but it was created using the 4105638a-e5a7-4c68-b8c2-614904a59131 cluster:

< t:2023-08-16 20:05:57,707 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c 4105638a-e5a7-4c68-b8c2-614904a59131 --restore-tables --location s3:manager-backup-tests-us-east-1  --snapshot-tag sm_20230816195600UTC" finished with status 0
< t:2023-08-16 20:05:57,708 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: restore/03db246a-0533-4a91-9079-9ecdf41884f2

The snapshot tag that the restore task uses, sm_20230816195600UTC, was created by the backup/6d118b0f-2652-4258-a797-9a191b0b26f2 task, which was also created using 4105638a-e5a7-4c68-b8c2-614904a59131:

< t:2023-08-16 19:54:15,367 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool backup -c 4105638a-e5a7-4c68-b8c2-614904a59131 --keyspace keyspace1  --location s3:manager-backup-tests-us-east-1  --cron '56 19 * * *' " finished with status 0
< t:2023-08-16 19:54:15,367 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: backup/6d118b0f-2652-4258-a797-9a191b0b26f2
...
< t:2023-08-16 19:56:24,300 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool  -c 4105638a-e5a7-4c68-b8c2-614904a59131 progress backup/6d118b0f-2652-4258-a797-9a191b0b26f2"...
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Run:            ec1624fd-3c6e-11ee-bd30-02b041669e61
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status:         DONE
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Start time:     16 Aug 23 19:56:00 UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > End time:       16 Aug 23 19:56:12 UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Duration:       12s
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Progress:       100%
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Snapshot Tag:   sm_20230816195600UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenters:    
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   - us-east
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╭─────────────┬──────────┬──────────┬──────────┬──────────────┬────────╮
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ Host        │ Progress │     Size │  Success │ Deduplicated │ Failed │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ├─────────────┼──────────┼──────────┼──────────┼──────────────┼────────┤
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.0.108 │     100% │ 283.379M │ 283.379M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.1.151 │     100% │ 283.379M │ 283.379M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.1.249 │     100% │ 283.377M │ 283.377M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╰─────────────┴──────────┴──────────┴──────────┴──────────────┴────────╯

So I really don't know how did 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 came into play.

Michal-Leszczynski commented 11 months ago

So from my perspective, this looks like there were concurrent test execution or something similar. @ShlomiBalalis was this issue ever reproduced? If not, could we rerun it and see if it still happens?

Michal-Leszczynski commented 7 months ago

@ShlomiBalalis ping

Michal-Leszczynski commented 6 months ago

@mikliapko is this something that you could take care of? I mean investigating the reasons for two clusters being there.