Open navilg opened 2 months ago
After velero pod restart it start working fine. But after few days, It again start failing.
Please share Velero log bundle by running velero debug
Thanks @Lyndon-Li Attached.
Could you try with v1.14.1? It seems related to PodVolumeBackup's event handler is not removed after the PodVolumeBackup completion. That is fixed in release-1.14.
Thanks @blackpiglet Upgrading velero to 1.14.0 is in my plan. is the same fix available in this as well ?
It should be, but v1.14.1 contains some other fixes. It's better to use the latest patch release.
After investigation, the timeout after 24 hours issue was not caused by the PodVolumeBackup's event handler's de-registered issue.
time="2024-09-16T14:00:29Z" level=info msg="pod eim1/otawg-deployment-bf74b684f-2wbc8 has volumes to backup: [apps-volume]" backup=velero/sch-backup-eim1-prod-daily-20240916135436 logSource="pkg/podvolume/backupper.go:174" name=otawg-deployment-bf74b684f-2wbc8 namespace=eim1 resource=pods
time="2024-09-17T14:04:59Z" level=warning msg="volume apps-volume is declared in pod eim1/otawg-deployment-bf74b684f-2wbc8 but not mounted by any container, skipping" backup=velero/sch-backup-eim1-prod-daily-20240916135436 logSource="pkg/podvolume/backupper.go:284" name=otawg-deployment-bf74b684f-2wbc8 namespace=eim1 resource=pods
The 24-hour gap happens here.
The Velero server pod took 24 hours to determine the volume apps-volume
should not be handled by PodVolumeBackup.
I haven't found any logic that would consume that much time yet.
@navilg
Could you help to check what happened between 2024-09-16T14:00:29Z
and 2024-09-17T14:04:59Z
in the k8s environment?
Let me take a check about that volume
app-volume is an emptyDir volume. I will test after excluding this volume from backup. Do we know of any known issue with backing-up of emptyDir volumes ? Also, is there a way to exclude all emptyDir volumes from backup other than annotating each pods using that volume ?
PodVolumeBackup can work with the emptyDir
type of volume.
https://velero.io/docs/v1.12/resource-filtering/#resource-policies
Yes, there is a way to exclude volumes from backup by type.
@blackpiglet Thanks. Using resource policies I don't see a way to exclude emptyDir volumes.
I excluded it by adding pod annotation. But even after excluding apps-volume, I still see same issue. Here is latest bundle
In the v1.14 doc you can see following.. you'd just configure for emptyDir and nothing else.
- conditions:
volumeTypes:
- emptyDir
- downwardAPI
- configmap
- cinder
action:
type: skip
Thanks. I got it. I checked 1.12 doc.
If resolved please close issue.
@kaovilai Even after excluding the apps-volume
from backup I see same issue. Backup bundle is attached in my previous comment. Would you please check the bundle and let me know
Did you try excluding using volumeTypes instead or not?
No. Does it make any difference in backup ? I am currently excluding wirh pod annotations.
After checking the new uploaded debug bundle, I think this is a limitation of current implementation of the file system backup.
This is the log of Velero server pod where the long pause causing the timeout happened.
time="2024-09-28T10:16:40Z" level=info msg="pod eim1/otac-0 has volumes to backup: [otac-dv otac-sd otac-config]" backup=velero/sch-backup-eim1-prod-daily-20240928101258 logSource="pkg/podvolume/backupper.go:174" name=otac-0 namespace=eim1 resource=pods
time="2024-09-30T10:10:17Z" level=info msg="1 errors encountered backup up item" backup=velero/sch-backup-eim1-prod-daily-20240928101258 logSource="pkg/backup/backup.go:444" name=otac-0
time="2024-09-30T10:10:17Z" level=error msg="Error backing up item" backup=velero/sch-backup-eim1-prod-daily-20240928101258 error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:317" error.function="github.com/vmware-tanzu/velero/pkg/podvolume.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:448" name=otac-0
And the related BackupRepository failed to run maintenance.
{
"apiVersion": "velero.io/v1",
"kind": "BackupRepository",
"metadata": {
"creationTimestamp": "2024-09-03T07:13:01Z",
"generateName": "eim1-default-kopia-",
"generation": 311,
"labels": {
"velero.io/repository-type": "kopia",
"velero.io/storage-location": "default",
"velero.io/volume-namespace": "eim1"
},
......
"name": "eim1-default-kopia-zww4r",
"namespace": "velero",
"resourceVersion": "457675105",
"uid": "260a78ad-483f-4b71-b0f0-4ca48e47630b"
},
"spec": {
"backupStorageLocation": "default",
"maintenanceFrequency": "1h0m0s",
"repositoryType": "kopia",
"resticIdentifier": "gs:velero-eim1-prod-2:/restic/eim1",
"volumeNamespace": "eim1"
},
"status": {
"lastMaintenanceTime": "2024-09-28T10:12:17Z",
"message": "error to prune backup repo: error to maintain repo: error to run maintenance under mode auto: snapshot GC failure: error running snapshot gc: unable to find in-use content ID: error processing snapshot root: error reading directory: unable to open object: k6ae37f611abc4bf06add2ad6d7a43ae2: unexpected content error: error getting cached content: failed to get blob with ID qe4ce25b8af5076a2f80f8c0e84e1c584-sa8e194a2af49df9312c: invalid blob offset or length",
"phase": "Ready"
}
},
From the code, I think the long pause was caused by ensuring the BackupRepository. https://github.com/vmware-tanzu/velero/blob/684f71306e9c2fda204a16cb012dc209523cfae1/pkg/podvolume/backupper.go#L170-L190
In the repository ensuring function, if lock cannot be acquired, the code will hang until timeout or succeed to get the lock.
What steps did you take and what happened:
A scheduled backup is running. Backup get stuck for 24 hours and fails with timeout error. In backup description I see some pod volumes backup completed but many missing. They are not even listed in failed or new section of description.
Logs:
What did you expect to happen:
Backup to work fine.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Environment:
velero version
): 1.12.3velero client config get features
): Nonekubectl version
): 1.28/etc/os-release
): Ubuntu 22.04 with containerdVote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.