vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.66k stars 1.39k forks source link

Velero attempts pod volume backups on nodes where restic is not scheduled #4752

Closed rnarenpujari closed 2 years ago

rnarenpujari commented 2 years ago

What steps did you take and what happened: Installed velero on a vmware TKGm provisioned cluster with restic enabled, then took a full cluster backup using --default-volumes-to-restic. Note that there exists a pod with an emptyDir volume on the master node. The backup remains in the InProgress phase for 4 hours stuck at 16/1521 items, after which it bails out and ends up PartiallyFailed.

What did you expect to happen: The backup to complete successfully within a reasonable amount of time.

There is no restic pod scheduled on the master node (since this node is tainted on most k8s clusters), thus preventing the backup of pod volumes on said node, in turn stalling the overall backup.

AFAICT velero makes no attempt to tolerate the default taint on the master node for the restic daemonset - so there is no guarantee that restic is on every node. At the same time, velero will attempt to use restic to back up pod volumes on any node in the cluster. So the --default-volumes-to-restic option seems a bit broken - it appears you would either need to a) guarantee restic is scheduled on every node or b) skip the pod volumes on nodes where a restic pod isn't scheduled. Of course, with the latter there is the question whether you can consider a backup complete if you've skipped volumes - although I imagine that's preferred over hanging and eventually failing.
And I don't feel that annotating pods on the master node(s) to exclude their volumes is a feasible workaround since they could be part of a larger deployment. Also, in my case a restic pod wasn't scheduled on the master node due to the taint, but perhaps in other cases restic pods may not end up scheduled on certain nodes for other reasons.

The following information will help us better understand what's going on:

Anything else you would like to add: Nodes: nodes.txt Velero ns pods: velero-ns-pods.txt Pod with volume on master node: kapp-controller-7c4b6db9-d6kgh.txt (AFAIK pods tolerating the master node taint is not uncommon) Pod Volume Backups: pod-volume-backups.txt

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

qiuming-best commented 2 years ago

@rnarenpujari I think for restic some nodes that could not be scheduled, maybe should be partial failed. user should fix it out for not scheduling or avoid backup the resources. But it should not be stuck for such long time. I'll check the codes to find out whether has a way to quick fail

qiuming-best commented 2 years ago

@rnarenpujari through the backup-logs.txt you provided, I found out that backup is stuck for 4hr which is waiting for the 16th item to complete volume backup (`timed out waiting for all PodVolumeBackups to complete) the 4hr duration is the default timeout configuration (--restic-timeout) for restic backup. you can check the doc here

rnarenpujari commented 2 years ago

@rnarenpujari I think for restic some nodes that could not be scheduled, maybe should be partial failed. user should fix it out for not scheduling or avoid backup the resources. But it should not be stuck for such long time. I'll check the codes to find out whether has a way to quick fail

Hi @qiuming-best. Yeah, I guess it's not ok to just skip pod volumes and mark the backup as complete, so perhaps partially failed is the correct handling here. But I still feel that it's not that uncommon for pod volumes to exist on the master node - I see similar reports on #2967.

@rnarenpujari through the backup-logs.txt you provided, I found out that backup is stuck for 4hr which is waiting for the 16th item to complete volume backup (`timed out waiting for all PodVolumeBackups to complete) the 4hr duration is the default timeout configuration (--restic-timeout) for restic backup. you can check the doc here

Oh interesting. But in this case, it's waiting 4hrs for an event that will never happen. Is it not possible to detect that the restic pod will never be scheduled and bail out immediately? Reducing the timeout may cause other backups to fail that take longer for a genuine reason.

reasonerjt commented 2 years ago

Looks like a dup of #4874 ? @Lyndon-Li please double check if it can be solved by the same approach we can close this one.

Lyndon-Li commented 2 years ago

Fix for #4874 could solve the current problem. Briefly speaking, with the fix, Velero PodVolume will skip to backup/restore the related volumes. And the backup/restore's status is PartiallyFailed. For more information, please check in #4874