vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

Data mover not working with velero 1.15.0 and Azure Workload Identity #8433

Open fredgate opened 6 days ago

fredgate commented 6 days ago

Resumption of issue https://github.com/vmware-tanzu/helm-charts/issues/627

What steps did you take and what happened:

On AKS, we backup persistent volumes backed by Azure Disk via CSI snapshot and data movement. Authentication against the object storage (Azure blob) used to upload backup metadata and CSI snapshot data is performed via Azure Workload Identity.

Starting with velero 1.15.0 (helm chart 8.0.0) the data upload actions were moved outside of the node agent into microservice pods, each dedicated to one DataUpload.
These pods, however, do not inherit the labels set to velero and node-agent pods via the helm value podLabels. On the other hand these pods uses velero-server service account well.

Azure Workload identity requires the label azure.workload.identity/use: "true" to be set such that the pod can source the client id from the service account. As a consequence of this missing label, authentication against Azure blob fails and the data upload cannot be completed.

Here the metadata description of such a pod :

Name:             test-cf5q7
Namespace:        velero
Priority:         0
Service Account:  velero-server
Node:             node1
Start Time:       Tue, 19 Nov 2024 16:42:54 +0100
Labels:           velero.io/data-upload=test-cf5q7
                  velero.io/exposer-pod-group=snapshot-exposer
Annotations:      <none>
Status:           Failed
Controlled By:    DataUpload/test-cf5q7
Containers:
  dce57391-c34d-4b84-9ec6-9b04f1dd4d78:
    Image:         registry.contoso.com/velero/velero:v1.15.0
    Command:
      /velero
      data-mover
      backup

What did you expect to happen:

The CSI snapshot is restored into a temporary PVC and uploaded towards Azure Blob.

Anything else you would like to add:

The creation of the micro-service pod for data movement is made by the csiSnapshotExposer.Expose method by providing labels issued from csiExposeParam.HostingPodLabels.
These labels are a map initialized with the data-upload label, and completed with the exposer-pod-group label

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

reasonerjt commented 21 hours ago

I recall i was a decision that the microservice DM pod should not copy all the labels and annotations of velero server and node-agent, we may add this label to the "white-list" and consider make the label configurable in future, if needed.

fredgate commented 16 hours ago

The labels of velero server and node agent pods are

labels:
  app.kubernetes.io/instance=velero
  app.kubernetes.io/managed-by=Helm
  app.kubernetes.io/name=velero
  app.kubernetes.io/version=1.15.0
  azure.workload.identity/use=true
  helm.sh/chart=velero-8.0.0
  name=velero
  pod-template-hash=54d8684d9

labels:
  app.kubernetes.io/instance=velero
  app.kubernetes.io/managed-by=Helm
  app.kubernetes.io/name=velero
  azure.workload.identity/use=true
  controller-revision-hash=556d89b4c6
  helm.sh/chart=velero-8.0.0
  name=node-agent
  pod-template-generation=5

Instead of to have a green-list of labels to copy to the micro-service pods, it could be better to use a red-list with known labels so if someone use some custom labels there will be present without velero needing to know them. These labels could in the red list and excluded from copy :

  app.kubernetes.io/instance
  app.kubernetes.io/managed-by
  app.kubernetes.io/name
  app.kubernetes.io/version
  helm.sh/chart
  name
  controller-revision-hash
  pod-template-hash
  pod-template-generation