Velero Backup partially failed because of listing the backup timeouts

MishraIshita commented 12 months ago

What steps did you take and what happened:

Velero backup is partially failing because of following error: time="2023-10-16T09:13:36Z" level=error msg="Error listing resources" backup=velero-prod-active/velero-backup-minute-20231016090000 error="the server was unable to return a response in the time allotted, but may still be processing the request" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_collector.go:312" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemCollector).getResourceItems" group=velero.io/v1 logSource="pkg/backup/item_collector.go:312" namespace= resource=backups

velero-backup-minute-20231016090000 PartiallyFailed 1 0 2023-10-16 09:05:46 +0000 UTC 1h default <none>

What did you expect to happen: The backup should be complete successfully without any issues The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml

kubectl describe backup velero-backup-minute-20231016090000 -n velero-prod-active Name: velero-backup-minute-20231016090000 Namespace: velero-prod-active Labels: velero.io/schedule-name=velero-backup-minute velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion: v1.24.14-gke.1200 velero.io/source-cluster-k8s-major-version: 1 velero.io/source-cluster-k8s-minor-version: 24 API Version: velero.io/v1 Kind: Backup Metadata: Creation Timestamp: 2023-10-16T09:00:00Z Generation: 284 Resource Version: 3578463564 UID: 32c18f99-6902-4905-981e-79701b397a5e Spec: Default Volumes To Restic: false Excluded Namespaces: kube-system kube-public kube-node-lease velero-prod-active Excluded Resources: customresourcedefinitions.apiextensions.k8s.io Hooks: Included Namespaces: * Snapshot Volumes: false Storage Location: default Ttl: 2h0m0s Status: Completion Timestamp: 2023-10-16T09:18:16Z Errors: 1 Expiration: 2023-10-16T11:05:46Z Format Version: 1.1.0 Phase: PartiallyFailed Progress: Items Backed Up: 244408 Total Items: 244408 Start Timestamp: 2023-10-16T09:05:46Z Version: 1 Events:

velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:

Velero version (use velero version): 1.7.1
Velero features (use velero client config get features):
Kubernetes version (use kubectl version): features:
Kubernetes installer & version: 1.24.14-gke.1200
Cloud provider or hardware configuration: GCP
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

reasonerjt commented 11 months ago

@ishitamishra13111999

This seems a timeout error, could you check if there's anything wrong on your api-server when velero's item collector list resources?

Are you using velero to backup the whole cluster or selected namespaces? Could you try to backup a single namespace ?

Additionally, the velero version you used is a bit old, could you try a more recent release like v1.11, and see if the problem remains?

sathishjeganathan commented 11 months ago

@reasonerjt we are trying to take whole cluster backup, & backing up a single namespace going fine without an issue. Is there a property from velero which can help these timeouts?

blackpiglet commented 11 months ago

@ishitamishra13111999 @sathishjeganathan First, could you try a newer version of Velero? v1.7.1 was released almost two years ago. v1.11.1 or v1.12.0 should be a better choice.

Second,

error="the server was unable to return a response in the time allotted, but may still be processing the request"

It looks like a K8s API-server side timeout. I'm afraid the Velero side cannot resolve that. Please check with your k8s cluster components' logs, especially the ETCD, to find more hints.

vmware-tanzu / velero

Velero Backup partially failed because of listing the backup timeouts #6961