vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

Azure Workload Identity authentication with AD fails #8324

Closed idanme-tr closed 4 weeks ago

idanme-tr commented 1 month ago

What steps did you take and what happened:

I am trying to deploy Velero Helm charts to AKS using Workload Identity. I've followed the Azure plugin guide with workload identity configurations.

For some reason, Velero cannot retrieve the storage account's properties. I've provided the managed identity with more permissions than needed to make sure I do not miss anything.

I understand that this issue might not be a bug but a misconfiguration, but I can't find what it is. When I am using Storage account key and not Workload identity it works fine.

What did you expect to happen: I expected Velero to be able to authenticate using the workload identity and to be able to backup and restore as it should.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

bundle-2024-10-20-11-47-04.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

I am adding my Helm configurations. Lines that were commented out were different attempts but were also unsuccessful.

velero:
  configuration:
    backupStorageLocation:
      - name: default
        provider: velero.io/azure
        bucket: int-aks-we02
        config:
          storageAccount: intaksvelerobackups
          resourceGroup: int-aks-velero-backups-rg
          # subscriptionId: ***********
          # storageAccountURI: https://intaksvelerobackups.blob.core.windows.net
          # activeDirectoryAuthorityURI: https://login.microsoftonline.com/
          useAAD: "true"
    volumeSnapshotLocation:
      - name: default
        provider: velero.io/azure
        config:
          resourceGroup: MC_int-aks-we02-rg_int-aks-we02_westeurope
          # subscriptionId: ***********
          # incremental: true
          # activeDirectoryAuthorityURI: https://login.microsoftonline.com/

  credentials:
    secretContents:
      cloud: |
        AZURE_SUBSCRIPTION_ID=***********
        AZURE_RESOURCE_GROUP=MC_int-aks-we02-rg_int-aks-we02_westeurope
        AZURE_CLOUD_NAME=AzurePublicCloud

  nodeAgent:
    enabled: true

  rbac:
    create: true
    clusterAdministrator: true
    clusterAdministratorName: cluster-admin

  serviceAccount:
    server:
      create: true
      name: "int-aks-we02-velero-sa"
      annotations:
        azure.workload.identity/client-id: ***********

  initContainers:
    - name: velero-plugin-for-microsoft-azure
      image: velero/velero-plugin-for-microsoft-azure:v1.10.1
      volumeMounts:
        - mountPath: /target
          name: plugins

  podLabels:
    azure.workload.identity/use: "true"

  schedules:
    daily:
      schedule: "0 2 * * *"
      template:
        ttl: 1h0m0s
        includedNamespaces: []
        excludedNamespaces: []
        storageLocation: default

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

anshulahuja98 commented 1 month ago

@idanme-tr, can you share the permissions you have applied? And also your BSL configuration - does it have the storageAccountUri.

anshulahuja98 commented 1 month ago

Also I would recommend checking if you have any leftover of PodIdentity in your cluster. That often leads to issues

idanme-tr commented 1 month ago

Hey, Thanks for the reply. We did have pod identity installed, I've deleted it and still had a few different errors so I've reconfigured everything on a new cluster that never had pod identity deployed to it.

The new identity has a few roles assigned to it on different levels. Storage account level - Reader

Resource group level - Reader Contributor Storage blob data contributor Velero custom role based on the documentation.

Current configurations -

velero:
  backupsEnabled: true
  snapshotsEnabled: false

  configuration:
    backupStorageLocation:
      - name: default
        provider: velero.io/azure
        bucket: int-omrizi-upgrades-01-we
        config:
          storageAccount: intaksvelerobackups
          resourceGroup: int-aks-velero-backups-rg
          activeDirectoryAuthorityURI: https://login.microsoftonline.com/
          useAAD: "true"
  credentials:
    secretContents:
      cloud: |
        AZURE_SUBSCRIPTION_ID=c5e7a9f2-8220-4dbd-8a43-545c473a8fda
        AZURE_RESOURCE_GROUP=MC_int-omrizi-upgrades-01-we-rg_int-omrizi-upgrades-01-we_westeurope
        AZURE_CLOUD_NAME=AzurePublicCloud

  serviceAccount:
    server:
      create: true
      name: "int-omrizi-upgrades-01-we-velero-sa"
      annotations:
        azure.workload.identity/client-id: ************

  podLabels:
    azure.workload.identity/use: "true"

  schedules:
    daily:
      schedule: "0 2 * * *"
      template:
        ttl: 336h0m0s # Set TTL for backups (14 days)
        includedNamespaces: []
        excludedNamespaces: 
        - kube-system
        - monitoring
        - twistlock
        - cloudhiro
        - keda
        storageLocation: default

  initContainers:
    - name: velero-plugin-for-microsoft-azure
      image: velero/velero-plugin-for-microsoft-azure:v1.10.1
      volumeMounts:
        - mountPath: /target
          name: plugins

  rbac:
    create: true
    clusterAdministrator: true
    clusterAdministratorName: cluster-admin

  resources:
    requests:
      cpu: 500m
      memory: 128Mi
    limits:
      cpu: 1000m
      memory: 512Mi

  upgradeJobResources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      cpu: 100m
      memory: 256Mi

  deployNodeAgent: true

  nodeAgent:
    priorityClassName: "system-node-critical"
    resources:
      requests:
        cpu: 500m
        memory: 512Mi
      limits:
        cpu: 1000m
        memory: 1024Mi

Logs -

velero time="2024-10-21T15:19:08Z" level=info msg="failed to retrieve the storage account properties: ManagedIdentityCredential: ManagedIdentityCredential: Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=&resource=https%3A%2F%2Fmanagement.core.windows.net%2F\": context deadline exceeded, fallback to use the default URI \"https://intaksvelerobackups.blob.core.windows.net\"" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-storage-location logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.14.1/pkg/util/azure/storage.go:208" pluginName=velero-plugin-for-microsoft-azure
velero time="2024-10-21T15:19:08Z" level=info msg="auth with Azure AD" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-storage-location logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.14.1/pkg/util/azure/storage.go:114" pluginName=velero-plugin-for-microsoft-azure
velero time="2024-10-21T15:19:08Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:141"
velero time="2024-10-21T15:19:15Z" level=info msg="failed to retrieve the storage account properties: ManagedIdentityCredential: ManagedIdentityCredential: Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=&resource=https%3A%2F%2Fmanagement.core.windows.net%2F\": context deadline exceeded, fallback to use the default URI \"https://intaksvelerobackups.blob.core.windows.net\"" backupLocation=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.14.1/pkg/util/azure/storage.go:208" pluginName=velero-plugin-for-microsoft-azurevelero time="2024-10-21T15:19:15Z" level=info msg="auth with Azure AD" backupLocation=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.14.1/pkg/util/azure/storage.go:114" pluginName=velero-plugin-for-microsoft-azure

When adding the URI directly into the BSL.

velero time="2024-10-21T16:07:46Z" level=info msg="the storage account URI \"https://intaksvelerobackups.blob.core.windows.net\" is specified in the BSL, use it directly" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-storage-location logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.14.1/pkg/util/azure/storage.go:171" pluginName=velero-plugin-for-microsoft-azure                                                                                                                                                                        
velero time="2024-10-21T16:07:46Z" level=info msg="auth with Azure AD" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-storage-location logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.14.1/pkg/util/azure/storage.go:114" pluginName=velero-plugin-for-microsoft-azure                                                                     
velero time="2024-10-21T16:07:46Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:141"                                                                                                                                                                           
velero time="2024-10-21T16:08:00Z" level=error msg="Error listing backups in backup store" backupLocation=velero/default controller=backup-sync error="rpc error: code = Unknown desc = ManagedIdentityCredential: ManagedIdentityCredential: Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=&resource=https%3A%2F%2Fstorage.azure.com\": context deadline exceeded" logSource="pkg/controller/backup_sync_controller.go:109"                                                                                                                                          
velero time="2024-10-21T16:08:00Z" level=info msg="plugin process exited" backupLocation=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync id=206 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-microsoft-azure                                                                                                         
velero time="2024-10-21T16:08:00Z" level=info msg="the storage account URI \"https://intaksvelerobackups.blob.core.windows.net\" is specified in the BSL, use it directly" backupLocation=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.14.1/pkg/util/azure/storage.go:171" pluginName=velero-plugin-for-microsoft-azure                                                                                                                                                                                             
velero time="2024-10-21T16:08:00Z" level=info msg="auth with Azure AD" backupLocation=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.14.1/pkg/util/azure/storage.go:114" pluginName=velero-plugin-for-microsoft-azure                                                                                          
velero time="2024-10-21T16:15:36Z" level=error msg="fail to validate backup store" backup-storage-location=velero/default controller=backup-storage-location error="rpc error: code = Unknown desc = ManagedIdentityCredential: ManagedIdentityCredential: Get \"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=&resource=https%3A%2F%2Fstorage.azure.com\": context deadline exceeded" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/persistence/object_store.go:206" error.function="github.com/vmware-tanzu/velero/pkg/persistence.(*objectBackupStore).IsValid" logSource="pkg/controller/backup_storage_location_controller.go:144"                                                                                                                                   
velero time="2024-10-21T16:15:36Z" level=info msg="BackupStorageLocation is invalid, marking as unavailable" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:120"                                                                                                                                                   

This is the BSL configurations -

apiVersion: v1
items:
- apiVersion: velero.io/v1
  kind: BackupStorageLocation
  metadata:
    annotations:
      meta.helm.sh/release-name: velero
      meta.helm.sh/release-namespace: velero
    creationTimestamp: "2024-10-21T15:28:26Z"
    generation: 8
    labels:
      app.kubernetes.io/instance: velero
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: velero
      helm.sh/chart: velero-7.2.1
    name: default
    namespace: velero
    resourceVersion: "55720075"
    uid: 926c3b55-abc7-4362-a977-21fec5791cc8
  spec:
    accessMode: ReadWrite
    config:
      activeDirectoryAuthorityURI: https://login.microsoftonline.com/
      resourceGroup: int-aks-velero-backups-rg
      storageAccount: intaksvelerobackups
      storageAccountURI: https://intaksvelerobackups.blob.core.windows.net
      useAAD: "true"
    default: true
    objectStorage:
      bucket: int-omrizi-upgrades-01-we
    provider: velero.io/azure
  status:
    lastValidationTime: "2024-10-21T16:15:36Z"
    message: 'BackupStorageLocation "default" is unavailable: rpc error: code = Unknown
      desc = ManagedIdentityCredential: ManagedIdentityCredential: Get "http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=&resource=https%3A%2F%2Fstorage.azure.com":
      context deadline exceeded'
    phase: Unavailable
kind: List
metadata:
  resourceVersion: ""

I am attaching another debug file. bundle-2024-10-21-19-24-33.tar.gz

Thanks again

anshulahuja98 commented 1 month ago
  1. Have you by any change restricted access to the IMDS endpoint? https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-cluster-security?tabs=azure-cli#restrict-access-to-instance-metadata-api

  2. Can you try to CURL on 169.254.169.254 from any pod.

idanme-tr commented 1 month ago

Hey, We restrict access to the API server but not to the instance metadata API you referred to. But, the cluster's subnet is whitelisted to reach the API server.

We have a few applications that work with Workload identities on a different cluster, so I don't believe it's blocked.

curl 169.254.169.254
<?xml version="1.0" encoding="utf-8"?>
<Error xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <Code>MissingRequiredQueryParameter</Code>
    <Message>A required query parameter was not specified for this request.</Message>
    <Details>'comp' is a required query string variable.</Details>
curl http://169.254.169.254/metadata/identity/oauth2/token
{"error":"invalid_request","error_description":"Required metadata header not specified"}
anshulahuja98 commented 4 weeks ago

okay, thanks for this info. I started digging in a different direction https://github.com/vmware-tanzu/velero/blob/8afe3cea8b7058f7baaf447b9fb407312c40d2da/pkg/util/azure/credential.go#L49

So basically, from your logs I can see code is going to ManagedIdentityCredential instead of kicking in for NewWorkloadIdentityCredential

Can you try to check if env has AZURE_FEDERATED_TOKEN_FILE injected ( for the pod I guess?)

My current hunch is that the workload identity for the velero pod is not setup correctly, it is not projecting the token into the service account and hence worklload identity auth is not kicking in.

idanme-tr commented 4 weeks ago

I found the issue. It was silly of me to copy-paste the "az aks update" from the documentation without noticing that it does not activate the workload identity add-on.

https://learn.microsoft.com/en-us/azure/aks/use-oidc-issuer#update-an-aks-cluster-with-oidc-issuer

I think the Azure plugin documentation needs to be refreshed a bit. Might be able to assist with that a bit later.

Thanks for the help!

anshulahuja98 commented 2 weeks ago

would it be possible for you to raise a PR for this small fix/ create an issue with the exact gaps you found? @idanme-tr