Open RaniaMidaoui opened 1 month ago
Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found
This means that Kopia uploader could not find the snapshot in the object store location specified in the BSL. So please double check objects in the object store where Kopia repository data is stored as indicated by the BSL, and make sure the BSLs in the source cluster and dest cluster points to the same object store location.
Restore is stuck in "In Progress" and it fails after timeout
If the error is Unable to load snapshot
, it should fail immediately. So please share the entire debug bundle by running velero debug
, we will further troubleshoot.
@Lyndon-Li Thank you for your response, here is the bundle you requested: bundle-2024-07-17-10-46-43.tar.gz
Another update: we checked with Kopia CLI and we can't find the snapshot either, but the cluster is connected to the right backup bucket, the BackupStorageLocation is listed as Available.
we checked with Kopia CLI and we can't find the snapshot either
Since you have connected to the kopia repo, could you run kopia repo status
kopia snapshot list --all
kopia content stats
, and share the outputs?
@Lyndon-Li sure.
rania.midaoui@MBP-Rania-Midaoui.local:~ $ kopia snapshot list --all
rania.midaoui@MBP-Rania-Midaoui.local:~ $ kopia repo status
Config file: /Users/rania.midaoui/Library/Application Support/kopia/repository.config
Description: Repository in S3: <our_url>
Hostname: mbp-rania-midaoui
Username: rania.midaoui
Read-only: false
Format blob cache: 15m0s
Storage type: s3
Storage capacity: unbounded
Storage config: {
"bucket": "de-instncs-0001-backup",
"prefix": "kopia/school-0031/",
"endpoint": "<endpoint>",
"accessKeyID": "<our_access_id>",
"secretAccessKey": "****************************************",
"sessionToken": ""
}
Unique ID: <UID>
Hash: <HASH>
Encryption: AES256-GCM-HMAC-SHA256
Splitter: DYNAMIC-4M-BUZHASH
Format version: 3
Content compression: true
Password changes: true
Max pack length: 21 MB
Index Format: v2
Epoch Manager: enabled
Current Epoch: 0
Epoch refresh frequency: 20m0s
Epoch advance on: 20 blobs or 10.5 MB, minimum 24h0m0s
Epoch cleanup margin: 4h0m0s
Epoch checkpoint every: 7 epochs
rania.midaoui@MBP-Rania-Midaoui.local:~ $ kopia content stats
Count: 1
Total Bytes: 276 B
Average: 276 B
Histogram:
0 between 0 B and 10 B (total 0 B)
0 between 10 B and 100 B (total 0 B)
1 between 100 B and 1 KB (total 304 B)
0 between 1 KB and 10 KB (total 0 B)
0 between 10 KB and 100 KB (total 0 B)
0 between 100 KB and 1 MB (total 0 B)
0 between 1 MB and 10 MB (total 0 B)
0 between 10 MB and 100 MB (total 0 B)
rania.midaoui@MBP-Rania-Midaoui.local:~ $
From the above output, the repo is empty. If the restore in the source cluster works well, which means the repo data is there, most probably, you are referring to the wrong location in the target cluster.
@Lyndon-Li I retried with a new backup, made sure to connect the right bucket to the cluster where I restore, I verified the BackupStorageLocation, its the same as the other cluster and it says its available. Even when I run velero backup get
I get the right backups.
The error is still the same.
And another thing, when I connect to the bucket and list Kopia snapshots, I still don't find anything, its empty.
when I connect to the bucket and list Kopia snapshots, I still don't find anything, its empty.
What do you see in this bucket? Do you see a kopia
prefix? If so, what do you see under the kopia
prefix?
@Lyndon-Li Deleting all the contents of the backup bucket solved the issue, but that is not a good solution, just a temporary fix to keep implementing. We cannot do this in a production environment. I don't know what exactly changed when we deleted the bucket contents, we didn't change anything else. Any ideas why this happened?
There is another error complaining about sync, similar to this one in this issue: https://github.com/kopia/kopia/issues/1938 I don't know if it is related.
Any suggestions to what might have happened or how to actually fix the issue from your side?
I don't think it is related to kopia issue 1938, because there is no error in the log you shared. From the log you shared, the connection just succeeded but there is no data in the repo as if the repo was newly created in the target.
Therefore, I do need some more info to get what was happening, e.g., the questions I asked in https://github.com/vmware-tanzu/velero/issues/8019#issuecomment-2242096169
@Lyndon-Li sure, I can see a Kopia folder inside the bucket, inside it there was only some files that start with _log_* , it seems to be log files.
We encountered the snapshot not found
error again last week, but when debugging in the beginning of this week, we couldn't reproduce it.
Maybe a short word about what we're doing: we're heavily switching between clusters for our backup / restore process development. Means we have a source-cluster where velero is running and creating backups and a target cluster where we do restores via velero. The S3 bucket is only accessible by one velero installation at a time (we can guarantee this because we use aws s3api put-bucket-policy with only one unique Principal). Maybe we trigger some weird caching effects during this switching back-and-forth.
We'll be still be attentive if it occurs another time.
What we've learnt during the debugging:
kopia snapshot list --manifest-id --all
Answer for all the related problems:
kopia snapshot list --all
you do see snapshots; or when running kopia content stats
you see the repo is not empty, if may be a switch over problem. However, we don't expect this would happen and there is no known issue about this. Please collect the log bundle by running velero debug
on both site when the problem happens so that we can further troubleshoot.
What steps did you take and what happened:
I am creating a file system backup from a particular namespace in a K8s cluster and restoring it to another cluster. But the Restore is stuck in "In Progress" and it fails after timeout (I am also backing up and restoring the Pod to which the volume is mounted, along with some Secrets and configMaps).
The backup is stored in an S3 bucket and I made sure that the same bucket is linked to the new cluster.
After investigating, I can see that for some reason, the PodVolumeRestore failed with the error:
data path restore failed: Failed to run Kopia restore: Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found
What did you expect to happen: Restore to complete without an issue.
The following information will help us better understand what's going on:
...
velero-64d44bf455-zcq96 velero time="2024-07-15T09:05:09Z" level=info msg="Attempting to sync backup into cluster" backup=school-0000-backup-20240711220015 backupLocation=velero/default controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:144"
....
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:09Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126" time="2024-07-15T09:07:11Z" level=info msg="starting restore" logSource="pkg/controller/restore_controller.go:535" restore=velero/school-0000-restore-r6ktt
....
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="No repository found, creating one" backupLocation=default logSource="pkg/repository/ensurer.go:89" repositoryType=kopia volumeNamespace=school-0000
...
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="Initializing backup repository" backupRepo=velero/school-0000-default-kopia-8s97q logSource="pkg/controller/backup_repository_controller.go:216"
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="Set matainenance according to repository suggestion" frequency=1h0m0s logSource="pkg/controller/backup_repository_controller.go:263"
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:11Z" level=info msg="the managed fields for school-0000/ldap-main-0 is patched" logSource="pkg/restore/restore.go:1714" restore=velero/school-0000-restore-r6ktt
....
velero-64d44bf455-zcq96 velero time="2024-07-15T09:07:29Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot 2e97d1c5b03468f979e3143149d46239: snapshot not found" logSource="pkg/restore/restore.go:1891" restore=velero/school-0000-restore-r6ktt
Client: Version: v1.13.2 Git commit: - Server: Version: v1.13.0
Client Version: v1.29.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.8