vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.59k stars 1.39k forks source link

velero in AKS 1.22.11 - Pods not coming up after restoration(From Velero Backup) #5246

Open zohebs341 opened 2 years ago

zohebs341 commented 2 years ago

Discussed in https://github.com/vmware-tanzu/velero/discussions/5245

Originally posted by **zohebs341** August 24, 2022 Environment: Azure AKS 1.22.11 Velero Version: velero/velero:v1.8.1 velero/velero-plugin-for-microsoft-azure:v1.4.1 As a part of our workload migration, I've deployed velero in both AKS clusters (Primary, Secondary). With velero, I am able to restore it in another cluster. Challenges: 1. Pod is not coming up. Looks like PVC attachment to the pod is not happening. 2. Source PVC(Backed up) is running in a node pool with multi-AZ. 3. Even though the destination cluster is running with multi AZ, but still I am facing this issue. I can see, that PVC/PV/storage class/svc got deployed along with the pod. But the pod should be up, then only I can use it. While restoration, am I missing something?
zohebs341 commented 2 years ago

In Pod events, I can see this info: 3 node (s) had volume node affinity conflict.

zohebs341 commented 2 years ago

Source PVC is in Multi AZ - NodePool Destination CLuster is using Multi AZ. Then still why pod mapping is failing?

blackpiglet commented 2 years ago

@zohebs341 Could you follow this link guidance to see whether it would resolve your problem? https://velero.io/docs/v1.9/restore-reference/#changing-pvc-selected-node

zohebs341 commented 2 years ago

@blackpiglet Thanks for your response. I've attached two files. (One one and failing one)

Kindly go through the documents attached NotWorking-Velero-AKS-1.22.11.docx word document attached. Working-Velero-AKS-1.22.11.docx

The problem that I noticed is: In the destination cluster - While we are restoring velero backup, it is not changing PV Zones. Ex: Backed up PVC/PV was in WestEurope, and after restoration PVC/PV is coming up with the same zones.

That's why the pod is not able to come up and getting volume conflicts.

But how come the same concept worked for me for one time? In working case: All configs are the same, while restoration PVC/PV came up under the NorthEurope region. So pods came up, as both pod/PVC/PV is under the same region.

Howcome sometimes it is working? Is it a bug from velero side or its an Issue from Azure AKS CSI Driver?

blackpiglet commented 2 years ago

@zohebs341 AFAIK, Velero Azure plugin doesn't support cross region backup and restore. @ywk253100 Am I right?

zohebs341 commented 2 years ago

@blackpiglet I am storing backups in NorthEurope. Even it worked for me one time, today when I tried again. It's not working.

If my backups were in WestEurope, then as you said - It won't support cross-region. Please can you check my attachment?

In this use case:

Source: WestEurope Dest: NorthEurope

And backup storage location is NorthEurope.

zohebs341 commented 2 years ago

After restoration in the destination cluster, the PV location is still pointed to the source cluster. But sometimes, after restoration PV location is pointing to the destination cluster region and the pod is coming up. I've attached both word documents in my previous comments.

kubectl describe pv pvc-762426ba-4b92-4af0-84f4-4ab76c627866

Name: pvc-762426ba-4b92-4af0-84f4-4ab76c627866 Labels: velero.io/backup-name=con-zrs velero.io/restore-name=con-zrs-ds Annotations: pv.kubernetes.io/provisioned-by: disk.csi.azure.com Finalizers: [kubernetes.io/pv-protection external-attacher/disk-csi-azure-com] StorageClass: csi-zrs Status: Bound Claim: default/zrs1gb Reclaim Policy: Retain Access Modes: RWO VolumeMode: Filesystem Capacity: 1Gi Node Affinity:
Required Terms:
Term 0: topology.disk.csi.azure.com/zone in [westeurope-1] Term 1: topology.disk.csi.azure.com/zone in [westeurope-2] Term 2: topology.disk.csi.azure.com/zone in [westeurope-3] Term 3: topology.disk.csi.azure.com/zone in [] Message:
Source: Type: CSI (a Container Storage Interface (CSI) volume source) Driver: disk.csi.azure.com

sseago commented 2 years ago

If you're going cross-region, you need to use restic for backup rather than snapshots. Restic does support cross-region (since completely new PVs are provisioned in the restore cluster, with data copied from the BackupStorageLocation), but AWS/Azure snapshotter plugins do not support cross-region restores.

zohebs341 commented 2 years ago

@sseago Thanks for your response. Got it.

zohebs341 commented 2 years ago

@sseago one last question.

What if source cluster (Region A) is running with No AZs. Destination cluster is running with Multi AZs (but same region - Region A)

In this case, velero backup/restore work without restic?

As both clusters are in same region but difference is with AZs

sseago commented 2 years ago

@zohebs341 I'm not 100% sure on this off the top of my head, but I think you're fine across AZs within the same region, but not across multiple regions.

blackpiglet commented 2 years ago

@zohebs341 I think there is possibility Velero plugins don't work in this case. I'm sure GCP plugin doesn't guarantee this function. For example, GCP has 6 AZs in us-central1 region. If your cluster is created regional, it means GKE will choose 3 random AZs from the 6 AZs in this region, so there is a big chance the source cluster's AZs is different from destination AZs, and Velero GCP plugin cannot handle the AZ matching by now.

zohebs341 commented 2 years ago

@sseago @blackpiglet I Just deployed a basic statefulset with PVC on No AZ NodePool(Cluster Region -A) Once the pod is up and PVC got attached to it. I add a node selector to that stateful set, to run on Multi-AZ NodePool of the same cluster.

Same Error: node (s) had volume node affinity conflict.

PVC belongs to No AZ Cluster/NodePool - cannot be used across Multi AZ Nodepool of the same Cluster. I guess restoration of such No AZ PVCs will fail, even if both clusters are in the same region.

But after converting that LRS PVC(of No AZ NodePool) to ZRS PVC, it worked. As ZRS supports multi-zone.

blackpiglet commented 2 years ago

@zohebs341 Sounds like this is the expected behavior. Since I'm not familar with Azure cloud provider, @ywk253100, could you please take a look to ensure?