vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.6k stars 1.39k forks source link

Pods in `Pending` Produce Errors (1.12.2-rc1) #7135

Closed pseymournutanix closed 10 months ago

pseymournutanix commented 10 months ago

What steps did you take and what happened: Pods in Pending state (as someone has put bad taints on them) produce errors in backups.

What did you expect to happen: Now running pods should be ignored.

[Uploading bundle-2023-11-21-13-57-28.tar.gz…]()

rnarenpujari commented 10 months ago

@pseymournutanix fyi your bundle hasn't uploaded.

blackpiglet commented 10 months ago

@rnarenpujari

I put the feedback of the Slack communication here to let other contributors also give some feedback.

time="2023-11-22T19:34:55Z" level=info msg="1 errors encountered backup up item" backup=velero/manual-bk-test logSource="pkg/backup/backup.go:444" name=plants-xq-qa-consul-consul-server-0

time="2023-11-22T19:34:55Z" level=error msg="Error backing up item" backup=velero/manual-bk-test error="node name is empty" error.file="/go/src/[github.com/vmware-tanzu/velero/pkg/nodeagent/node_agent.go:57](https://github.com/vmware-tanzu/velero/pkg/nodeagent/node_agent.go:57)" error.function=[github.com/vmware-tanzu/velero/pkg/nodeagent.IsRunningInNode](https://github.com/vmware-tanzu/velero/pkg/nodeagent.IsRunningInNode) logSource="pkg/backup/backup.go:448" name=plants-xq-qa-consul-consul-server-0

In some cases, the backup is marked as PartiallyFailed because the pods, that need backing up volume data by the filesystem uploader, don't have a node assigned yet. The error is generated here. https://github.com/vmware-tanzu/velero/blob/7320bb76744bc7052d839644fcbe34eb746a0f20/pkg/podvolume/backupper.go#L173

The discussion is whether the Velero server needs to fail the backup here. There is a less strict check following. The check is, when the pod is not in running state, returning without error. https://github.com/vmware-tanzu/velero/blob/7320bb76744bc7052d839644fcbe34eb746a0f20/pkg/podvolume/backupper.go#L223

I think the Velero server can log some information without returning an error when the pod doesn't have a node name too.

@reasonerjt Please take a look.

pseymournutanix commented 10 months ago

bundle-2023-11-21-13-57-28.tar.gz

blackpiglet commented 10 months ago
name: /canaveral-analytics-prometheus-565f54bbf8-8zwmj error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Unable to read dir in path /host_pods/f3f91a0a-4f7a-499f-918e-7681825e33e6/volumes/kubernetes.io~csi/pvc-dc7341b6-d500-4b41-a7d0-b80870d9dea8/mount: open /host_pods/f3f91a0a-4f7a-499f-918e-7681825e33e6/volumes/kubernetes.io~csi/pvc-dc7341b6-d500-4b41-a7d0-b80870d9dea8/mount: input/output error
             name: /canaveral-analytics-pushgateway-8c884d5c-tkfv2 error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Unable to read dir in path /host_pods/69d66904-5573-4c0b-a79f-e7b2964a5cf8/volumes/kubernetes.io~csi/pvc-f4d5f669-1764-496c-bf01-0e655fc850b9/mount: open /host_pods/69d66904-5573-4c0b-a79f-e7b2964a5cf8/volumes/kubernetes.io~csi/pvc-f4d5f669-1764-496c-bf01-0e655fc850b9/mount: input/output error
             name: /canaveral-config-store-867d947fbc-hb2vj error: /node name is empty
             name: /canaveral-onboarding-new-77fb56c49f-5jtn7 error: /node name is empty

The backup has four errors. Two of them are related to node name empty issue that is discussed above. The rest errors are failing to read mount directory with input/output error. May I ask the provider of your k8s environment? Looks like it's a hardware issue. https://unix.stackexchange.com/questions/39905/input-output-error-when-accessing-a-directory

pseymournutanix commented 10 months ago

Thank you. Yes I fixed the volume errors (it's Nutanix BTW) for those I would expect a failure condition :)

yanggangtony commented 10 months ago

@blackpiglet

I think the Velero server can log some information without returning an error when the pod doesn't have a node name too.

I see the log already tips /node name is empty What you means 'log some information about'?

reasonerjt commented 10 months ago

Per discussion with @Lyndon-Li We move the chunk of https://github.com/vmware-tanzu/velero/blob/fea22bbbc9faf5a28c2570313df715ffb5721d11/pkg/podvolume/backupper.go#L223

Before checking the status of nodeagent to avoid passing an empty nodename to that function.