vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.59k stars 1.39k forks source link

Velero cannot use two VolumeSnapshotClass #5737

Closed slawekww closed 1 year ago

slawekww commented 1 year ago

What steps did you take and what happened: Use velero helm chart 3.0.0 using app version 1.10.0 in AKS 1.24.6: deploy 2 velero instances in two namespaces: test-1 and test-2 using two different StorageAccounts in two different Azure Resource groups.

Define VolumeSnapshotClass:

apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Retain
driver: disk.csi.azure.com
kind: VolumeSnapshotClass
metadata:
  name: vsc-rg1
parameters:
  # If it is not set, it points to Resource group where AKS is deployed
  resourcegroup: my-rg-1
---
apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Retain
driver: disk.csi.azure.com
kind: VolumeSnapshotClass
metadata:
  name: vsc-rg2
parameters:
  # If it is not set, it points to Resource group where AKS is deployed
  resourcegroup: my-rg-2

Define VolumeSnapshotLocation:

apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: vsl1
spec:
  config:
    apiTimeout: 5m
    incremental: "true"
    resourceGroup: my-rg-1
    subscriptionId: my-subscription-id
  provider: azure
---
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: vsl2
spec:
  config:
    apiTimeout: 5m
    incremental: "true"
    resourceGroup: my-rg-2
    subscriptionId: my-subscription-id
  provider: azure

What did you expect to happen: I expect that each instance of velero stores Azure Snapshots volumes into different Azure resource group. Now only VolumeStorageClass vsc-rg1 is used and regardless settings in VolumeSnapshotLocation, snapshot volume is always stored on my-rg-1. If it would be possible to define many VolumeSnapshotLocation and use one velero instance, I welcome to use it however helm chart allows to define only one.

The following information will help us better understand what's going on:

Collected debug logs and attached however there is no error in backup but snaphots are stored in wrong location. bundle-2023-01-04-11-04-47.tar.gz

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Vote on this issue!

It should be allowed to store snapshots into two different Resource Groups by velero from one cluster.

slawekww commented 1 year ago

I'm not sure however found that VolumeSnapshot is created with nodeAffinity and point to exact Availability zone used by volume. Even if I want to restore it on different AKS cluster, it must be in the same Azure region and node must be in the same Availability zone. Volume NodeAffinity at VolumeSnapshot may block to store snapshot on different Azure region - I do not know for sure, just suspect it.

reasonerjt commented 1 year ago

@slawekww Do you have to use CSI Snapshotter? If you are relying on azure plugin for the snapshot in v1.10 it would be possible to take snapshot into different VSL.

However, it is a limitation for CSI Plugin. @blackpiglet could you open an issue to address this requirement in the scope of CSI plugin in particular?

slawekww commented 1 year ago

@slawekww Do you have to use CSI Snapshotter? If you are relying on azure plugin for the snapshot in v1.10 it would be possible to take snapshot into different VSL.

I rely on Velero plugins:

AKS version 1.24.6 has CSI storage classes/drivers installed automatically using version 1.24.0.2. mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.24.0.2

reasonerjt commented 1 year ago

@slawekww

Let me clarify, there are two code paths to take snapshots on azure, you may choose NOT to rely on the velero-plugin-for-csi, b/c in that case the plugin velero-plugin-for-microsoft-azure will call Azure API to take the snapshot for the underlying disk. To do that you may try no to turn on the CSI feature flag when you install velero.

There is a limitation in CSI plugin being not able to take snapshot via CSI snapshot API for different vsClasses, #5750 has been opened to track this work, but it won't be implemented in v1.11.

slawekww commented 1 year ago

@reasonerjt Thank you for guidance! I will test this scenario without using velero-plugin-for-csi plugin and let you know results.

slawekww commented 1 year ago

Run test with Velero and disabled velero-plugin-for-csi - result is that backup failed. Regardless if VolumeSnapshotClass CR has parameters.resourcegroup fill in or not, Velero was not able to find pv disk as it always pointing to Backup RSG instead of original RSG where AKS is deployed.

Log with error:

time="2023-01-09T13:01:17Z" level=error msg="Error backing up item" backup=velero/velero-test-20230109130055 error="error getting volume info: rpc error: code = Unknown desc = compute.DisksClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code=\"ResourceNotFound\" Message=\"The Resource 'Microsoft.Compute/disks/pvc-68a0781c-1ce8-4657-94f1-d0c019a2386d' under resource group BACKUP_RSG' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix\"" logSource="pkg/backup/backup.go:425" name=test-85b657899-2th4r

Once updated velero-credentials secret to contain AZURE_RESOURCE_GROUP as AKS resource group, backup is successful however snapshots are stored in AKS resource group instead of Backup_RSG as defined by VolumeSnapshotClass. Note: velero backupStorageLocation points always to Storage Account at Backup_RSG.

ywk253100 commented 1 year ago

The VolumeSnapshotLocation rather than VolumeSnapshotClass will be used if you disable the CSI plugin.

For your case, you should

slawekww commented 1 year ago

Thanks! I had created Azure snapshots into two different Azure resource groups when plugin velero-plugin-for-csi is disabled. However snapshot location (region) is always the same location as AKS cluster. Is it any option to create snapshots into two different locations in Azure?

ywk253100 commented 1 year ago

I don't think you can create the snapshots at a different region with the disk/AKS cluster.

slawekww commented 1 year ago

Let assume Azure snapshots are copied manually into different Azure resource group (location) using Azure SDK API or az cli command.

Could you advice what should be changed in Velero stored backup files to use copied Azure snapshots? Is it even possible to re-use copied Azure snapshots?

ywk253100 commented 1 year ago

I'm not sure whether it is possible or not, we didn't test this use case.

Maybe you can go through the logic here to do more investigation and testing

slawekww commented 1 year ago

Lets close this issue as basically it is possible to create Snapshots into two different Azure resource groups. Snapshots are always in the same Azure location (region) as AKS cluster regardless what location is used by Azure resource group and it is Azure limitation. I may do more testing to update Velero backup files velero-\-\-volumesnapshots.json

  {
    "spec": {
      "backupName": "velero-default-20230111000057",
      "backupUID": "457259a5-619f-4147-a246-9fd652a17370",
      "location": "default",
      "PersistentVolumeName": "pvc-id",
      "providerVolumeID": "pvc-volumeid",
      "volumeType": "StandardSSD_LRS",
      "volumeAZ": "regionorig-2" # change to regiontarget-availabilityzoneid 
    },
    "status": {
      "providerSnapshotID": "/subscriptions/subId/resourceGroups/Resoure_Group_CopiedSnapshot/providers/Microsoft.Compute/snapshots/pvc-id-79038e1d-172f-4611-91ed-b9fa91738dd6",
      "phase": "Completed"
    }

and try to run Velero restore using those files but it is really hack way and it may not work.

adrianmarcu18 commented 1 year ago

Let assume Azure snapshots are copied manually into different Azure resource group (location) using Azure SDK API or az cli command.

Could you advice what should be changed in Velero stored backup files to use copied Azure snapshots? Is it even possible to re-use copied Azure snapshots?

Even this has been closed, maybe it will help. I have tested this scenario and it definitely works pretty well. Have been doing that in order to be able to do cross-region cluster restore. What needs to be changed is the volumesnapshots.json.gz where you need to patch "providerSnapshotID" and change the resource group to the new one (location of your copied snapshots).

Additionally, you need to patch nodeAffinity of the actual PV resource from the backup tar file (change old region name with the new region name).

The pain here is the actual snapshot copy process and metadata manual patching.