vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

Velero 1.14.1 & aws plugin 1.10.1: using wrong role with IRSA #8240

Open lrstanley opened 2 months ago

lrstanley commented 2 months ago

What steps did you take and what happened:

It looks as though with the latest version of the velero-plugin-for-aws plugin is incorrectly utilizing IRSA. It looks like it is using the nodes attached role, rather than the role attached to the service account.

What did you expect to happen:

If an IRSA role is attached to the service account velero is using, I would expect it to use that role.

The following information will help us better understand what's going on:

Unable to provide a support bundle due to the sensitivity of this cluster. With that being said, hopefully this is enough information.

Errors:

time="2024-09-23T20:45:30Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/1/0, BackupStorageLocation \"default\" is unavailable: rpc error: code = Unknown desc = operation error S3: ListObjectsV2, https response error StatusCode: 403, RequestID: TRUNCATED, HostID: TRUNCATED, api error AccessDenied: User: arn:aws:sts::TRUNCATED:assumed-role/TRUNCATED-worker/TRUNCATED is not authorized to perform: s3:ListBucket on resource: \"arn:aws:s3:::TRUNCATED-prod-velero\" because no identity-based policy allows the s3:ListBucket action)" controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:178"
time="2024-09-23T20:45:30Z" level=info msg="plugin process exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location id=200 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws
time="2024-09-23T20:46:20Z" level=error msg="Error listing backups in backup store" backupLocation=velero/default controller=backup-sync error="rpc error: code = Unknown desc = operation error S3: ListObjectsV2, https response error StatusCode: 403, RequestID: TRUNCATED, HostID: TRUNCATED, api error AccessDenied: User: arn:aws:sts::TRUNCATED:assumed-role/TRUNCATED-worker/TRUNCATED is not authorized to perform: s3:ListBucket on resource: \"arn:aws:s3:::TRUNCATED-prod-velero\" because no identity-based policy allows the s3:ListBucket action" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:351" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:109"
time="2024-09-23T20:46:20Z" level=info msg="plugin process exited" backupLocation=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-sync id=213 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws

Helm chart configuration:

resources:
  requests:
    cpu: 250m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 1.5Gi
initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.10.1
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
configuration:
  extraEnvVars:
    GOMEMLIMIT: 1024MiB
  fsBackupTimeout: 480m
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: TRUNCATED-${attribute_aws_account_env}-velero
      default: true
      config:
        region: us-east-1
  volumeSnapshotLocation:
    - name: default
      provider: aws
      config:
        region: us-east-1
  # Set true for backup all pod volumes without having to apply annotation on the pod when used file system backup Default: false.
  defaultVolumesToFsBackup: false
backupsEnabled: true
snapshotsEnabled: true
deployNodeAgent: true
# credentials:
#   useSecret: true
#   existingSecret: velero-s3
serviceAccount:
  server:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: "${attribute_role_arn_velero}"
metrics:
  serviceMonitor:
    enabled: true
  prometheusRule:
    enabled: true
    spec:
      - alert: VeleroBackupPartialFailures
        annotations:
          message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} partialy failed backups.
        expr: |-
          velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
        for: 15m
        labels:
          severity: warning
      - alert: VeleroBackupFailures
        annotations:
          message: Velero backup {{ $labels.schedule }} has {{ $value | humanizePercentage }} failed backups.
        expr: |-
          velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
        for: 15m
        labels:
          severity: warning
nodeAgent:
  extraEnvVars:
    GOMEMLIMIT: 2048MiB
  resources:
    requests:
      cpu: 250m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
schedules:
  default-weekly:
    schedule: "0 3 * * 6"
    template:
      includedNamespaces:
        - "*"
      ttl: 1440h0m0s
      storageLocation: default
    useOwnerReferencesInBackup: false
  default-daily:
    schedule: "0 5 * * *"
    template:
      includedNamespaces:
        - "*"
      ttl: 168h0m0s
      storageLocation: default
    useOwnerReferencesInBackup: false

service account yaml, directly from the cluster, showing the appropriate annotation:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::TRUNCATED:role/TRUNCATED-velero
    meta.helm.sh/release-name: velero
    meta.helm.sh/release-namespace: velero
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    helm.sh/chart: velero-7.2.1
  name: velero-server
  namespace: velero
  # [...]

Given the above output, it looks like Velero is using the default role from IMDS/the ec2 worker role, not the IRSA role. Worth noting that prior to this version, we were on 1.10.x, and IRSA was working without issue. Looks like the switch to sdk-v2 has caused some issues.

May also be related to the following issues:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

witalisoft commented 1 month ago

try to set:

credentials:
   useSecret: false