Restoring a file system backup to a different cluster failed due to Kopia snapshot not found

RaniaMidaoui commented 1 month ago

What steps did you take and what happened:

I am creating a file system backup from a particular namespace in a K8s cluster and restoring it to another cluster. But the Restore is stuck in "In Progress" and it fails after timeout (I am also backing up and restoring the Pod to which the volume is mounted, along with some Secrets and configMaps).

The backup is stored in an S3 bucket and I made sure that the same bucket is linked to the new cluster.

After investigating, I can see that for some reason, the PodVolumeRestore failed with the error: data path restore failed: Failed to run Kopia restore: Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found

What did you expect to happen: Restore to complete without an issue.

The following information will help us better understand what's going on:

The Velero pod and the node agents log erros are the following:


velero-64d44bf455-zcq96 velero  time="2024-07-15T09:05:09Z" level=info msg="Found 95 backups in the backup location that do not exist in the cluster and need to be synced" backupLocation=velero/default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:136"

...

velero-64d44bf455-zcq96 velero time="2024-07-15T09:05:09Z" level=info msg="Attempting to sync backup into cluster" backup=school-0000-backup-20240711220015 backupLocation=velero/default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:144"

....

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:09Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126" time="2024-07-15T09:07:11Z" level=info msg="starting restore" logSource="pkg/controller/restore_controller.go:535" restore=velero/school-0000-restore-r6ktt

....

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="No repository found, creating one" backupLocation=default logSource="pkg/repository/ensurer.go:89" repositoryType=kopia volumeNamespace=school-0000

...

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="Initializing backup repository" backupRepo=velero/school-0000-default-kopia-8s97q logSource="pkg/controller/backup_repository_controller.go:216"

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="Set matainenance according to repository suggestion" frequency=1h0m0s logSource="pkg/controller/backup_repository_controller.go:263"

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="the managed fields for school-0000/ldap-main-0 is patched" logSource="pkg/restore/restore.go:1714" restore=velero/school-0000-restore-r6ktt

....

velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:29Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found" logSource="pkg/restore/restore.go:1891" restore=velero/school-0000-restore-r6ktt

- The BackupStorageLocation is Available

**Anything else you would like to add:**
Restoring the backup to the same cluster it was taken from works with no issues, this only happens when I restore to a different cluster.

**Environment:**

- Velero version (use `velero version`):

Client: Version: v1.13.2 Git commit: - Server: Version: v1.13.0


- Velero features (use `velero client config get features`): 
`features: <NOT SET>`

- Kubernetes version (use `kubectl version`):

Client Version: v1.29.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.8

Lyndon-Li commented 1 month ago

Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found

This means that Kopia uploader could not find the snapshot in the object store location specified in the BSL. So please double check objects in the object store where Kopia repository data is stored as indicated by the BSL, and make sure the BSLs in the source cluster and dest cluster points to the same object store location.

Lyndon-Li commented 1 month ago

Restore is stuck in "In Progress" and it fails after timeout

If the error is Unable to load snapshot, it should fail immediately. So please share the entire debug bundle by running velero debug, we will further troubleshoot.

RaniaMidaoui commented 1 month ago

@Lyndon-Li Thank you for your response, here is the bundle you requested: bundle-2024-07-17-10-46-43.tar.gz

Another update: we checked with Kopia CLI and we can't find the snapshot either, but the cluster is connected to the right backup bucket, the BackupStorageLocation is listed as Available.

Lyndon-Li commented 1 month ago

we checked with Kopia CLI and we can't find the snapshot either

Since you have connected to the kopia repo, could you run kopia repo status kopia snapshot list --all kopia content stats, and share the outputs?

RaniaMidaoui commented 1 month ago

@Lyndon-Li sure.

rania.midaoui@MBP-Rania-Midaoui.local:~ $ kopia snapshot list --all

rania.midaoui@MBP-Rania-Midaoui.local:~ $ kopia repo status
Config file:         /Users/rania.midaoui/Library/Application Support/kopia/repository.config

Description:         Repository in S3: <our_url>
Hostname:            mbp-rania-midaoui
Username:            rania.midaoui
Read-only:           false
Format blob cache:   15m0s

Storage type:        s3
Storage capacity:    unbounded
Storage config:      {
                       "bucket": "de-instncs-0001-backup",
                       "prefix": "kopia/school-0031/",
                       "endpoint": "<endpoint>",
                       "accessKeyID": "<our_access_id>",
                       "secretAccessKey": "****************************************",
                       "sessionToken": ""
                     }

Unique ID:           <UID>
Hash:                <HASH>
Encryption:          AES256-GCM-HMAC-SHA256
Splitter:            DYNAMIC-4M-BUZHASH
Format version:      3
Content compression: true
Password changes:    true
Max pack length:     21 MB
Index Format:        v2

Epoch Manager:       enabled
Current Epoch: 0

Epoch refresh frequency: 20m0s
Epoch advance on:        20 blobs or 10.5 MB, minimum 24h0m0s
Epoch cleanup margin:    4h0m0s
Epoch checkpoint every:  7 epochs

rania.midaoui@MBP-Rania-Midaoui.local:~ $ kopia content stats
Count: 1
Total Bytes: 276 B
Average: 276 B
Histogram:

        0 between 0 B and 10 B (total 0 B)
        0 between 10 B and 100 B (total 0 B)
        1 between 100 B and 1 KB (total 304 B)
        0 between 1 KB and 10 KB (total 0 B)
        0 between 10 KB and 100 KB (total 0 B)
        0 between 100 KB and 1 MB (total 0 B)
        0 between 1 MB and 10 MB (total 0 B)
        0 between 10 MB and 100 MB (total 0 B)
rania.midaoui@MBP-Rania-Midaoui.local:~ $

Lyndon-Li commented 1 month ago

From the above output, the repo is empty. If the restore in the source cluster works well, which means the repo data is there, most probably, you are referring to the wrong location in the target cluster.

RaniaMidaoui commented 1 month ago

@Lyndon-Li I retried with a new backup, made sure to connect the right bucket to the cluster where I restore, I verified the BackupStorageLocation, its the same as the other cluster and it says its available. Even when I run velero backup get I get the right backups. The error is still the same.

And another thing, when I connect to the bucket and list Kopia snapshots, I still don't find anything, its empty.

Lyndon-Li commented 1 month ago

when I connect to the bucket and list Kopia snapshots, I still don't find anything, its empty.

What do you see in this bucket? Do you see a kopia prefix? If so, what do you see under the kopia prefix?

RaniaMidaoui commented 1 month ago

@Lyndon-Li Deleting all the contents of the backup bucket solved the issue, but that is not a good solution, just a temporary fix to keep implementing. We cannot do this in a production environment. I don't know what exactly changed when we deleted the bucket contents, we didn't change anything else. Any ideas why this happened?

There is another error complaining about sync, similar to this one in this issue: https://github.com/kopia/kopia/issues/1938 I don't know if it is related.

Any suggestions to what might have happened or how to actually fix the issue from your side?

Lyndon-Li commented 1 month ago

I don't think it is related to kopia issue 1938, because there is no error in the log you shared. From the log you shared, the connection just succeeded but there is no data in the repo as if the repo was newly created in the target.

Therefore, I do need some more info to get what was happening, e.g., the questions I asked in https://github.com/vmware-tanzu/velero/issues/8019#issuecomment-2242096169

RaniaMidaoui commented 1 month ago

@Lyndon-Li sure, I can see a Kopia folder inside the bucket, inside it there was only some files that start with _log_* , it seems to be log files.

wkloucek commented 1 month ago

We encountered the snapshot not found error again last week, but when debugging in the beginning of this week, we couldn't reproduce it.

Maybe a short word about what we're doing: we're heavily switching between clusters for our backup / restore process development. Means we have a source-cluster where velero is running and creating backups and a target cluster where we do restores via velero. The S3 bucket is only accessible by one velero installation at a time (we can guarantee this because we use aws s3api put-bucket-policy with only one unique Principal). Maybe we trigger some weird caching effects during this switching back-and-forth.

We'll be still be attentive if it occurs another time.

What we've learnt during the debugging:

The snapshot ID referenced by Velero is actually the manifest ID of the Kopia snapshot, that can only been seen in the listing when you include this flag: kopia snapshot list --manifest-id --all

Lyndon-Li commented 1 month ago

Answer for all the related problems:

If you can see kopia repo data, e.g., when running kopia snapshot list --all you do see snapshots; or when running kopia content stats you see the repo is not empty, if may be a switch over problem. However, we don't expect this would happen and there is no known issue about this. Please collect the log bundle by running velero debug on both site when the problem happens so that we can further troubleshoot.
If you cannot see any data in the kopia repo, it may indicates that the source and target sites are not referring to the same object store location or your object store location has been destroyed. You need to double check your env and see what happened.

vmware-tanzu / velero

Restoring a file system backup to a different cluster failed due to Kopia snapshot not found #8019