vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.66k stars 1.39k forks source link

few backup which has emptyDir volume in pod stuck on openshift cluster using velero restic #5113

Closed Ankita5892 closed 2 years ago

Ankita5892 commented 2 years ago

What steps did you take and what happened: followed Version: v1.5.2 velero vesrion official article below is command to install velero on openshift

velero install --provider aws --bucket --secret-file --backup-location-config region= --use-restic --use-volume-snapshots=false --plugins=velero/velero-plugin-for-aws:v1.1.0 --velero-pod-cpu-limit 1500m --velero-pod-cpu-request 700m --velero-pod-mem-limit 2Gi --velero-pod-mem-request 500Mi --restic-pod-cpu-limit 1500m --restic-pod-cpu-request 500m --restic-pod-mem-limit 2Gi --restic-pod-mem-request 500Mi --default-volumes-to-restic

then applied patch to restic daemon set

oc adm policy add-scc-to-user privileged -z velero -n velero oc patch ds/restic --namespace velero --type json -p '[{"op":"add","path":"/spec/template/spec/containers/0/securityContext","value": { "privileged": true}}]'

followed https://velero.io/docs/main/restic/

openshift is configured on AWS cloud

openshift version Client Version: 4.7.19 Server Version: 4.7.19

backup stuck in progress 11 items backed up then in progress only not moving

velero backup describe openshift-full-cluster-backup

Name: openshift-full-cluster-backup Namespace: velero Labels: velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion=v1.20.0+87cc9a4 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=20

Phase: InProgress

Errors: 0 Warnings: 0

Namespaces: Included: * Excluded: velero

Resources: Included: * Excluded: Cluster-scoped: auto

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 720h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2022-07-08 14:09:59 +0000 UTC Completed: <n/a>

Expiration: 2022-08-07 14:09:59 +0000 UTC

Estimated total items to be backed up: 5237 Items backed up so far: 11

Velero-Native Snapshots:

Restic Backups (specify --details for more information): Completed: 1 New: 1

Cluster-scoped: auto

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 720h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2022-07-08 14:09:59 +0000 UTC Completed: <n/a>

Expiration: 2022-08-07 14:09:59 +0000 UTC

Estimated total items to be backed up: 5237 Items backed up so far: 11

Resource List:

Velero-Native Snapshots:

Restic Backups: Completed: openshift-adp/openshift-adp-controller-manager-56dc9468b7-nqcgh: bound-sa-token New: openshift-cloud-credential-operator/pod-identity-webhook-7fdfd9b5d8-5j6qv: webhook-certs

in openshift cluster i have 3 workernodes

NAME READY STATUS RESTARTS AGE restic-ddcbk 1/1 Running 0 7m17s restic-kcxgt 1/1 Running 0 7m15s restic-rmd7m 1/1 Running 0 7m14s velero-57fbc78b8c-f5gbn 1/1 Running 0 7m44s

logs from velero pod

time="2022-07-08T14:10:25Z" level=info msg="Processing item" backup=velero/openshift-full-cluster-backup logSource="pkg/backup/backup.go:378" name=pod-identity-webhook-7fdfd9b5d8-5j6qv namespace=openshift-cloud-credential-operator progress= resource=pods time="2022-07-08T14:10:25Z" level=info msg="Backing up item" backup=velero/openshift-full-cluster-backup logSource="pkg/backup/item_backupper.go:121" name=pod-identity-webhook-7fdfd9b5d8-5j6qv namespace=openshift-cloud-credential-operator resource=pods time="2022-07-08T14:10:25Z" level=info msg="Executing custom action" backup=velero/openshift-full-cluster-backup logSource="pkg/backup/item_backupper.go:327" name=pod-identity-webhook-7fdfd9b5d8-5j6qv namespace=openshift-cloud-credential-operator resource=pods time="2022-07-08T14:10:25Z" level=info msg="Executing podAction" backup=velero/openshift-full-cluster-backup cmd=/velero logSource="pkg/backup/pod_action.go:51" pluginName=velero time="2022-07-08T14:10:25Z" level=info msg="Done executing podAction" backup=velero/openshift-full-cluster-backup cmd=/velero logSource="pkg/backup/pod_action.go:77" pluginName=velero time="2022-07-08T14:10:25Z" level=info msg="Initializing restic repository" controller=restic-repository logSource="pkg/controller/restic_repository_controller.go:158" name=openshift-cloud-credential-operator-default-76lrp namespace=velero time="2022-07-08T14:10:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:10:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:10:29Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120" time="2022-07-08T14:11:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:11:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:11:29Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120" time="2022-07-08T14:12:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:12:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:12:29Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120" I0708 14:13:28.752015 1 request.go:621] Throttling request took 1.044197433s, request: GET:https://145.32.0.5:443/apis/console.openshift.io/v1alpha1?timeout=32s time="2022-07-08T14:13:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:13:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58" time="2022-07-08T14:13:29Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:120" time="2022-07-08T14:14:29Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:58"

What did you expect to happen: all backup should work fine

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
No If you are using earlier versions:
1.5 version

Name: openshift-full-cluster-backup Namespace: velero Labels: velero.io/storage-location=default Annotations: velero.io/source-cluster-k8s-gitversion=v1.20.0+87cc9a4 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=20

Phase: InProgress

Errors: 0 Warnings: 0

Namespaces: Included: * Excluded: velero

Resources: Included: * Excluded: Cluster-scoped: auto

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 720h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2022-07-08 14:09:59 +0000 UTC Completed: <n/a>

Expiration: 2022-08-07 14:09:59 +0000 UTC

Estimated total items to be backed up: 5237 Items backed up so far: 11

Velero-Native Snapshots:

Restic Backups (specify --details for more information): Completed: 1 New: 1

Cluster-scoped: auto

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 720h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2022-07-08 14:09:59 +0000 UTC Completed: <n/a>

Expiration: 2022-08-07 14:09:59 +0000 UTC

Estimated total items to be backed up: 5237 Items backed up so far: 11

Resource List:

Velero-Native Snapshots:

Restic Backups: Completed: openshift-adp/openshift-adp-controller-manager-56dc9468b7-nqcgh: bound-sa-token New: openshift-cloud-credential-operator/pod-identity-webhook-7fdfd9b5d8-5j6qv: webhook-certs

Anything else you would like to add: few of the namespaces backup completed but few namespaces backup stuck

Environment:

Server: Version: v1.5.2

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Ankita5892 commented 2 years ago

restic #openshift

sseago commented 2 years ago

How long was the backup running? From the logs it looks like the backup was waiting for the BackupStorageLocation to be valid. Maybe an issue with your s3 bucket?

In any case, 1.5 is an old version of Velero. 1.9 was just released. It's probably better to try again with a newer version and see if you're still having problems.

Ankita5892 commented 2 years ago

@sseago so i have noted openshift's namespace few are backing up only those namespace stuck which pod has any emptyDir volume and those backup never end.like status will be in progress

so i have configured s3 bucket --bucket --backup-location-config region=eu-west-1

velero get backup-location -n velero NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE default aws Available 2022-07-08 14:22:30 +0000 UTC ReadWrite

below is the logs where you can see openshift namespace backup competed successfully (if location was issue then each backup must have issue but here its not the case only those namespace stuck which pod has emptyDir volume)

Namespaces:

Included: openshift-sdn, openshift-service-ca, openshift-vsphere-infra, openshift-kube-apiserver, openshift-etcd Excluded:

Resources: Included: * Excluded: Cluster-scoped: auto

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 720h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2022-07-12 10:11:12 +0000 UTC Completed: 2022-07-12 10:12:07 +0000 UTC

Expiration: 2022-08-11 10:11:12 +0000 UTC

Total items to be backed up: 425

Items backed up: 425

Velero-Native Snapshots:

below is the backup which is stuck

Estimated total items to be backed up: 5237 Items backed up so far: 11

Resource List:

Velero-Native Snapshots:

Restic Backups: Completed: openshift-adp/openshift-adp-controller-manager-56dc9468b7-jkdhfjksdjd: bound-sa-token New: ### openshift-cloud-credential-operator/pod-identity-webhook-hjyfgttttddd: webhook-certs

this webhook-cert is emptyDir volume type

And i have upgraded to 1.9 and aws plugin 1.5 issue was same but in velero's pod had no error until its reached to same emptyDir volume once its reached to those emptyDir volume it started showing below error

level=error msg="Error updating download request" controller=download-request downloadRequest=velero/backup-1-6fd8a471-1235-494g-237f-6dd312267829 error="downloadrequests.velero.io \"backup-1-6fd8a471-1235-494g-237f-6dd312267829\" not found"

sseago commented 2 years ago

I'm not really sure what that downloadrequest is referring to. Is that the name of your backup that it seems like it can't find? Also, you mentioned that the backup was stuck, but above I'm seeing a start and completion timestamp on the backup, so I'm not really sure what's going on. Restic should support emptydir backup, though. In any case, it looks like there's one restic volume that completed, and one is still in a New state. Look at the restic pod logs for the restic pod that's on the same node that the pod mounting the volume is on -- it could be that the pod is unhealthy. Also look at the PodVolumeBackup for that pod and volume.

Ankita5892 commented 2 years ago

@sseago yes , as i have mentioned few namespaces backup completed successfully but whichever pod has emptyDir volume it will stuck there.

i have shared both backup completed with --included few namespaces name and when i tried full backup it will stuck on emptyDir vol pod backup

and i can see pod status is running status

sseago commented 2 years ago

Were any of the pods that succeeded restic backups running on the same node as the failing pod? The PVB is in a "new" state still, which seems to indicate that Restic isn't even trying to back it up. If you look at the pod logs for the restic pod that should have processed the PVB, maybe there is some indication of what went wrong.

Ankita5892 commented 2 years ago

Hi @sseago yes succeeded and stuck both backup's pod are running on similar nodes

which pod is stuck that logs are clean i don't see any error

ubuntu@:~$ kubectl logs -f pod-identity-webhook-7fdfd9b5d8-t6s5q -n openshift-cloud-credential-operator W0714 04:14:02.115927 1 client_config.go:551] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0714 04:14:02.129849 1 store.go:61] Fetched secret: openshift-cloud-credential-operator/pod-identity-webhook I0714 04:14:02.130220 1 main.go:174] Creating server I0714 04:14:02.130331 1 main.go:194] Listening on :9999 for metrics and healthz I0714 04:14:02.130446 1 main.go:188] Listening on :6443

Ankita5892 commented 2 years ago

this is the logs from 1.9 velero's pod once backup stuck it will start showing below logs

time="2022-07-14T05:40:05Z" level=debug msg="waiting for stdio data" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio time="2022-07-14T05:40:05Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130" time="2022-07-14T05:40:05Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115" time="2022-07-14T05:40:05Z" level=debug msg="received EOF, stopping recv loop" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio time="2022-07-14T05:40:05Z" level=debug msg="plugin process exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws pid=1018 time="2022-07-14T05:40:05Z" level=debug msg="plugin exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75"

Ankita5892 commented 2 years ago

Same setup is working fine for Azure, Aws and GCP only i am facing issue with Openshift cluster

sseago commented 2 years ago

@Ankita5892 Could you also supply the restic pod logs for the pod that failed backup? You need to find the restic pod that's running on the same node as the pod with failed pod volume backups. It's not clear from the above whether there were successful and failed volumes on the same node. It might be worth also looking at the restic pod logs for the restic pod on the same node as a successful volume backup, if it's not the same restic pod.

Ankita5892 commented 2 years ago

@sseago so its openshift cluster so we have 3 master and 3 worker nodes and many pods are running on worker nodes and master node both but restic pods count is only 3 and they all are running on worker only

and which backup is stuck (in progress) that pod is also running on master node

sseago commented 2 years ago

OK. So yes, this is starting to make some more sense now. DaemonSets can't be scheduled on master nodes by default. In the OADP context, this is not normally a problem since the default openshift operators that run on master nodes are considered part of the control plane rather than user workloads, which is out-of-scope of the supported OADP use cases. You might be able to succeed in a restic backup of master-node volumes by modifying node taints or using a custom node selector to force the restic DaemonSet onto master nodes, but it's not a scenario that we've tested, and there's no guarantee you won't hit other problems when doing this.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

Closing the stale issue.