vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.73k stars 1.41k forks source link

Fails to restore Ingress when restoring a namespace #5068

Open navilg opened 2 years ago

navilg commented 2 years ago

What steps did you take and what happened: [A clear and concise description of what the bug is, and what commands you ran.)

I accidentally deleted an entire namespace. When I tried to restore the namespace from backup, All the resources are getting restored except Ingress. I am using Nginx Ingress v0.47.0.

Command run to restore:

velero create restore restore-v1.8.1 --from-backup v1.8.1 --exclude-namespaces=velero,kube-system --include-cluster-resources

Error logs:

time="2022-07-01T09:53:54Z" level=info msg="Getting client for extensions/v1beta1, Kind=Ingress" logSource="pkg/restore/restore.go:856" restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=info msg="Attempting to restore Ingress: gitlab" logSource="pkg/restore/restore.go:1217" restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=error msg="error restoring gitlab: Internal error occurred: failed calling webhook \"validate.nginx.ingress.kubernetes.io\": Post \"https://ingress-nginx-controller-admission.ethan.svc:443/networking/v1beta1/ingresses?timeout=10s\": service \"ingress-nginx-controller-admission\" not found" logSource="pkg/restore/restore.go:1287" restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=info msg="Restored 496 items out of an estimated total of 577 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:643" name=gitlab namespace=ethan progress= resource=ingresses.extensions restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=info msg="Attempting to restore Ingress: graphql" logSource="pkg/restore/restore.go:1217" restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=error msg="error restoring graphql: Internal error occurred: failed calling webhook \"validate.nginx.ingress.kubernetes.io\": Post \"https://ingress-nginx-controller-admission.ethan.svc:443/networking/v1beta1/ingresses?timeout=10s\": service \"ingress-nginx-controller-admission\" not found" logSource="pkg/restore/restore.go:1287" restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=info msg="Restored 497 items out of an estimated total of 577 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:643" name=graphql namespace=ethan progress= resource=ingresses.extensions restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=info msg="Attempting to restore Ingress: jenkins" logSource="pkg/restore/restore.go:1217" restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=error msg="error restoring jenkins: Internal error occurred: failed calling webhook \"validate.nginx.ingress.kubernetes.io\": Post \"https://ingress-nginx-controller-admission.ethan.svc:443/networking/v1beta1/ingresses?timeout=10s\": service \"ingress-nginx-controller-admission\" not found" logSource="pkg/restore/restore.go:1287" restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=info msg="Restored 499 items out of an estimated total of 577 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:643" name=jenkins namespace=ethan progress= resource=ingresses.extensions restore=velero/restore-v1.8.1
time="2022-07-01T09:53:54Z" level=info msg="Attempting to restore Ingress: keycloak" logSource="pkg/restore/restore.go:1217" restore=velero/restore-v1.8.1

What did you expect to happen:

Ingress should have been restored properly.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

From the logs, I noticed that Ingress is being restored before Services. Nginx-Ingress-admission controller's validatingwebhookconfiguration ingress-nginx-admission tries to validate the ingress. To validate the Ingress it make a POST call on https://ingress-nginx-controller-admission.ethan.svc:443 which is service DNS. Since serviceis not yet restored when ingresses are getting restored, It fails to reach out to that service DNS and restore fails.

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Lyndon-Li commented 2 years ago

Currently Velero restore follows below orders:

Custom Resource Definitions
Mamespaces
StorageClasses
VolumeSnapshotClass
VolumeSnapshotContents
VolumeSnapshots
PersistentVolumes
PersistentVolumeClaims
Secrets
ConfigMaps
ServiceAccounts
LimitRanges
Pods
ReplicaSets
Clusters
ClusterResourceSets

Any resource that are not in the list are restored in alphabet order.

Therefore, in the current case, the Service resource is restored after Ingress CRDs. This finally caused the restore failure.

There is one way to change the order, you can specify --restoreResourcePrioritiesoption along with Velero server and customize the entire order. For more information, refer here It doesn't guarantee to success because the service itself may have dependencies. However, this looks like the only workaround with current Velero, it is worthy to try.

The ultimate solution is to introduce a dependency management to Velero restore. However, it is not that easy, because Velero doesn't know what the application controllers do, so it is hard for Velero to tell the dependency. Anyway, we will pay some efforts to investigate it.

navilg commented 2 years ago

Thanks @Lyndon-Li.

There is one way to change the order, you can specify --restoreResourcePrioritiesoption along with Velero server and customize the entire order. For more information, refer here It doesn't guarantee to success because the service itself may have dependencies. However, this looks like the only workaround with current Velero, it is worthy to try.

I will give this a try. Can we add any other resources in --restoreResourcePriorities which are not mentioned in above list. E.g. can we add Service,Ingress in this argument even if I don't see them in above ordered list you mentioned.

The ultimate solution is to introduce a dependency management to Velero restore. However, it is not that easy, because Velero doesn't know what the application controllers do, so it is hard for Velero to tell the dependency. Anyway, we will pay some efforts to investigate it.

Appreciate it. If we can dig into it and come up with a permanent solution to this. Since Nginx Ingress is used in most clusters, it would be affecting the restoration of any specific namespace which has Ingress.

Lyndon-Li commented 2 years ago

There is an existing issue asking for document the limitation of Velero restore with admission webhook #4847. And also a proposal to solve this kind of problems #4572.

Just add the above information here for reference.