vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.77k stars 1.41k forks source link

Backup requires cluster scope to succeed, even when installed without clusterAdministrator #5156

Closed fryz closed 11 months ago

fryz commented 2 years ago

What steps did you take and what happened:

Due to restrictions on the k8s environment I am running in, I cannot use ClusterRoleBindings because it grants cluster scope. To address this, I have installed Velero (using the helm chart, values.yaml file provided below) with the following configuration:

  1. Disabled rbac.clusterAdministrator, and therefore, the ClusterRoleBinding is not generated and not applied
  2. Created a ServiceAccount for velero that uses IRSA annotations
  3. Created a Role definition which grants permissions for all apiGroups/resources/verbs that I'd need to backup the state in my cluster
  4. Created a RoleBinding to bind the Role from (3) to the ServiceAccount in (2)
  5. Velero is installed into the namespace which hosts our application (named app), and we expect to only backup resources within that namespace (eg: not CRDs, PVs, etc.)

With the aforementioned configuration, I try to take a backup:

velero backup create backup-test-$(date -Idate) -n app

And I get the following in the backup-controller logs:

time="2022-07-27T15:55:41Z" level=info msg="Setting up backup log" backup=app/backup-test-2022-07-27 controller=backup logSource="pkg/controller/backup_controller.go:557"
time="2022-07-27T15:55:41Z" level=info msg="Setting up backup temp file" backup=app/backup-test-2022-07-27 logSource="pkg/controller/backup_controller.go:579"
time="2022-07-27T15:55:41Z" level=info msg="Setting up plugin manager" backup=app/backup-test-2022-07-27 logSource="pkg/controller/backup_controller.go:586"
time="2022-07-27T15:55:41Z" level=info msg="Getting backup item actions" backup=app/backup-test-2022-07-27 logSource="pkg/controller/backup_controller.go:590"
time="2022-07-27T15:55:41Z" level=info msg="Setting up backup store to check for backup existence" backup=app/backup-test-2022-07-27 logSource="pkg/controller/backup_controller.go:600"
time="2022-07-27T15:55:41Z" level=info msg="Writing backup version file" backup=app/backup-test-2022-07-27 logSource="pkg/backup/backup.go:192"
time="2022-07-27T15:55:41Z" level=info msg="Including namespaces: *" backup=app/backup-test-2022-07-27 logSource="pkg/backup/backup.go:198"
time="2022-07-27T15:55:41Z" level=info msg="Excluding namespaces: <none>" backup=app/backup-test-2022-07-27 logSource="pkg/backup/backup.go:199"
time="2022-07-27T15:55:41Z" level=info msg="Including resources: *" backup=app/backup-test-2022-07-27 logSource="pkg/backup/backup.go:202"
time="2022-07-27T15:55:41Z" level=info msg="Excluding resources: clusterrolebindings.rbac.authorization.k8s.io" backup=app/backup-test-2022-07-27 logSource="pkg/backup/backup.go:203"
time="2022-07-27T15:55:41Z" level=info msg="Backing up all pod volumes using Restic: true" backup=app/backup-test-2022-07-27 logSource="pkg/backup/backup.go:204"
time="2022-07-27T15:55:41Z" level=info msg="Setting up backup store to persist the backup" backup=app/backup-test-2022-07-27 logSource="pkg/controller/backup_controller.go:720"
time="2022-07-27T15:55:41Z" level=info msg="Backup completed" backup=app/backup-test-2022-07-27 controller=backup logSource="pkg/controller/backup_controller.go:730"
time="2022-07-27T15:55:41Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unknown desc = clusterrolebindings.rbac.authorization.k8s.io is forbidden: User \"system:serviceaccount:app:velero-server\" cannot list resource \"clusterrolebindings\" in API group \"rbac.authorization.k8s.io\" at the cluster scope" key=app/backup-test-2022-07-27 logSource="pkg/controller/backup_controller.go:298"

With the specific failure being:

time="2022-07-27T15:55:41Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unknown desc = clusterrolebindings.rbac.authorization.k8s.io is forbidden: User \"system:serviceaccount:app:velero-server\" cannot list resource \"clusterrolebindings\" in API group \"rbac.authorization.k8s.io\" at the cluster scope" key=app/backup-test-2022-07-27 logSource="pkg/controller/backup_controller.go:298"

Even when I attempt to exclude CRBs as a resource from the backup, I get the same error:

velero backup create backup-test-$(date -Idate) -n app --exclude-resources clusterrolebindings.rbac.authorization.k8s.io

What did you expect to happen:

Backup's should not require cluster-scope to backup resources when disabled, or I should be able to exclude resources that require cluster-scope from the backup to get the backup to succeed.

The following information will help us better understand what's going on:

I'm not going to attach the debug output because it contains the logs and state of all pods/resources in the cluster. Let me know if you need specific information and I can provide.

Anything else you would like to add:

Helm Values File:


initContainers:
  - name: velero-plugin-for-aws
    image: REDACTED  # Note: we are using v1.5.0 of the plugin, but hosted in an airgapped repository
    volumeMounts:
      - name: plugins
        mountPath: /target

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000

extraObjects:
  - apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: velero-server
      namespace: app
      labels:
        app.kubernetes.io/component: server
        app.kubernetes.io/name: velero
    rules:
    - apiGroups:
        - crd.k8s.amazonaws.com
        - rbac.authorization.k8s.io
        - certificates.k8s.io
        - metrics.k8s.io
        - admissionregistration.k8s.io
        - authorization.k8s.io
        - node.k8s.io
        - events.k8s.io
        - clickhouse.altinity.com
        - apps
        - extensions
        - batch
        - v1
        - argoproj.io
        - velero.io
        - scheduling.k8s.io
        - apiextensions.k8s.io
        - authentication.k8s.io
        - flowcontrol.apiserver.k8s.io
        - policy
        - autoscaling
        - networking.k8s.io
        - apiregistration.k8s.io
        - discovery.k8s.io
        - storage.k8s.io
        - monitoring.coreos.com
        - vpcresources.k8s.aws
        - coordination.k8s.io
        - ""
      resources:
        - rolebindings
        - podtemplates
        - apiservices
        - resticrepositories
        - horizontalpodautoscalers
        - backupstoragelocations
        - serverstatusrequests
        - resourcequotas
        - alertmanagers
        - componentstatuses
        - leases
        - volumeattachments
        - persistentvolumes
        - deployments
        - workflows
        - podvolumerestores
        - statefulsets
        - workflowtaskresults
        - ingresses
        - workfloweventbindings
        - networkpolicies
        - podvolumebackups
        - eniconfigs
        - bindings
        - volumesnapshotlocations
        - replicasets
        - workflowtasksets
        - localsubjectaccessreviews
        - deletebackuprequests
        - prometheuses
        - nodes
        - jobs
        - certificatesigningrequests
        - secrets
        - customresourcedefinitions
        - replicationcontrollers
        - cronjobs
        - ingressclasses
        - podsecuritypolicies
        - controllerrevisions
        - selfsubjectaccessreviews
        - csinodes
        - storageclasses
        - cronworkflows
        - namespaces
        - clusterroles
        - roles
        - priorityclasses
        - services
        - clickhouseinstallations
        - limitranges
        - clusterworkflowtemplates
        - serviceaccounts
        - clusterrolebindings
        - prioritylevelconfigurations
        - csidrivers
        - daemonsets
        - restores
        - endpointslices
        - alertmanagerconfigs
        - flowschemas
        - subjectaccessreviews
        - clickhouseinstallationtemplates
        - backups
        - downloadrequests
        - selfsubjectrulesreviews
        - mutatingwebhookconfigurations
        - configmaps
        - probes
        - podmonitors
        - thanosrulers
        - clickhouseoperatorconfigurations
        - csistoragecapacities
        - persistentvolumeclaims
        - workflowtemplates
        - servicemonitors
        - schedules
        - tokenreviews
        - securitygrouppolicies
        - poddisruptionbudgets
        - runtimeclasses
        - pods
        - events
        - prometheusrules
        - validatingwebhookconfigurations
        - endpoints
      verbs:
        - watch
        - update
        - deletecollection
        - create
        - delete
        - patch
        - get
        - list
  - apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: velero-server
      namespace: app
      labels:
        app.kubernetes.io/component: server
        app.kubernetes.io/name: velero
    subjects:
      - kind: ServiceAccount
        namespace: app
        name: velero-server
    roleRef:
      kind: Role
      name: velero-server
      apiGroup: rbac.authorization.k8s.io

upgradeCRDs: false

configuration:
  provider: aws

  backupStorageLocation:
    name: default
    bucket: REDACTED
    config:
      region: us-east-2

  volumeSnapshotLocation:
    name: REDACTED
    config:
      region: us-east-2

  defaultVolumesToRestic: true

rbac:
  create: false
  clusterAdministrator: false

serviceAccount:
  server:
    create: true
    name: velero-server
    annotations:
      eks.amazonaws.com/sts-regional-endpoints: "true"
      eks.amazonaws.com/role-arn: arn:aws:iam::REDACTED

credentials:
  useSecret: false

deployRestic: true

restic:
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

fryz commented 2 years ago

@reasonerjt - I'm wondering if you have any thoughts/ideas on this issue.

I've done some more digging, and I haven't been able to make Velero work without granting list permissions on resources I wish to backup at cluster scope.

It appears that Velero requires the service-account to have permissions to list ALL resources at cluster-scope, even if it's configured to only backup resources in a single namespace.

Is there some way around this?

I first tried to see if I could control the backup and prevent it from accessing resources that might be at cluster scope, or resources which it might not have permissions to list/get. For both of these cases below, I continued to get the permission error listed in the original description (eg: error getting ClusterRoleBindings at cluster-scope)

  1. Using --include-cluster-resources=false to stop the backup from accessing cluster-scoped resources
  2. Using --include-namespaces to only include the single namespace I want to backup (this is the same namespace where velero is installed)

Then, on a dev environment (where I have permissions to create ClusterRoleBindings) I tried to see if I could proceed with the backup by creating a ClusterRoleBinding that only allowed for listing CRBs. I was able to proceed, but got new failures now that looked similar:

time="2022-08-08T20:07:28Z" level=error msg="Error listing resources" backup=app/backup-test1-2022-08-08 error="pods is forbidden: User \"system:serviceaccount:app:velero-server\" cannot list resource \"pods\" in API group \"\" at the cluster scope" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_collector.go:476" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemCollector).

Any thoughts?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

Closing the stale issue.

wozneyr commented 1 year ago

@reasonerjt Has there been any development on this? I'm noticing a similar issue trying to use velero v1.10.0 in Azure

kaovilai commented 1 year ago

duplicate of https://github.com/vmware-tanzu/velero/issues/18

When #18 is closed is when implementation can begin and then it'll be released.

BNutterJP commented 1 year ago

Hi - Can this JIRA be re-opened?

blackpiglet commented 1 year ago

I think this is related to k8s ServiceAccount resource. There is a Velero plugin applies to ServiceAccount. The plugin will go through all ClusterRoleBinding and ClusterRole, then return the related ClusterRoleBinding and ClusterRole to the ServiceAccount as additional backup items.

If it's possible, please also exclude the ServiceAccount resource from the backup.

blackpiglet commented 1 year ago

That is a bit hacky solution, and there may also be other resources that need cluster-scope resources access permission.

@fryz Could you give some details about your environment? Velero is designed to work with the cluster administrator's permission. Is it possible to let Velero have the administrator's permission, and only back resources in specific namespace?

blackpiglet commented 1 year ago

@fryz One thing that confused me about your scenario is that Role and RoleBinding are used to set permission for the Velero server, and the Role also includes quite some cluster-scoped resources in its permission. Please be aware that Role can only used to grant permission to namespace-scoped resources.

I also saw that ClusterRole and ClusterRoleBinding are also included in the Role permissions, so you are just trying to back namespaced-scope resource up, right?

mcgrawia commented 1 year ago

Hi @blackpiglet, I work closely with @fryz and can try to provide some context on our environment. We are deploying an application in an enterprise Kubernetes environment in which we only have access to a single namespace as the application vendor. The enterprise controls do not allow us to use any ClusterRoles or ClusterRoleBindings, especially ClusterAdministrator. We are hoping to be able to use Velero in a namespace-scoped fashion so we can backup and restore our application.

Thanks for the tip about excluding the service accounts. I will try to get a test setup in the next few days to try your suggestion. @fryz is currently out but should be back next week as well with more information.

Thanks for the help!

blackpiglet commented 1 year ago

I did some tests to install Velero without any cluster-scoped resource permissions, but the Velero server failed to start. The failure is:

time="2023-10-08T09:10:44Z" level=info msg="Checking existence of namespace." logSource="pkg/cmd/server/server.go:445" namespace=velero
An error occurred: namespaces "velero" is forbidden: User "system:serviceaccount:velero:velero-server" cannot get resource "namespaces" in API group "" in the namespace "velero"

I think this is due to the Velero server needing to confirm the namespace it runs in exists before spinning up the controllers.

https://github.com/vmware-tanzu/velero/blob/b7cc62d077818854b29113c70a6e4ec5bf0ca1ad/pkg/cmd/server/server.go#L412-L417

It seems that error didn't happen in your cluster. Could you help to confirm your environment's Velero permission setting? It looks like it's inevitable to be involved with some cluster-scoped resource access.

fryz commented 1 year ago

Hey @blackpiglet - thanks for the work and attempts to help us get this working.

Quick note - when we originally ran into this issue, we were using Velero 1.10 and the 2.32 version of the helm chart. It looks like things have changed a bit (esp. on the helm chart) since that release, so the specific errors seen above might only impact the version that we're deploying.

My plan was to take the first step and confirm if this works on the latest version using your workaround re: ServiceAccounts above (sounds like it doesn't work based on your latest comment?). Then, if it doesn't work, I was going to try on the versions that we are using to see if the workaround works there.

fryz commented 1 year ago

Hey @blackpiglet

Just quickly wanted to let you know that when I tried installing/configuring using the setup above only using Role/RoleBindings (on the 1.10 version of velero with the 2.32 version of the helm-chart), I ran into issues with Velero during server initialization because it's looking for the Velero Custom Resources at cluster scope:

time="2023-10-09T23:56:44Z" level=error msg="failed to list backups" error="backups.velero.io is forbidden: User \"system:serviceaccount:arthur:velero-server\
" cannot list resource \"backups\" in API group \"velero.io\" at the cluster scope" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/cmd/server/server.g
o:954" error.function=github.com/vmware-tanzu/velero/pkg/cmd/server.markInProgressBackupsFailed logSource="pkg/cmd/server/server.go:954"

So I think from our end, I wasn't able to get to the point where I could take a backup without granting the velero-server ServiceAccount cluster-scoped permissions on the Velero Cluster Resources.

blackpiglet commented 1 year ago

@fryz Thanks for reporting this issue. The code does look a little different there.

I think this piece of code is used to mark the InProgress backups as failed during the Velero server start. Could you try to delete the InProgress backups before starting the Velero server? This is a temporary workaround. It can make the progress go further to see whether there are other obstacles. I will try to resolve this issue in the main branch.

blackpiglet commented 1 year ago

@fryz I tested with the PR. The error of failing to read the Velero CRs from the cluster scope is gone, but when creating a backup, there are still many errors due to no permission to read k8s resources.

Errors:
  Velero:    error: /backendconfigs.cloud.google.com is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "backendconfigs" in API group "cloud.google.com" in the namespace "default"
             error: /capacityrequests.internal.autoscaling.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "capacityrequests" in API group "internal.autoscaling.gke.io" in the namespace "default"
             error: /managedcertificates.networking.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "managedcertificates" in API group "networking.gke.io" in the namespace "default"
             error: /serviceattachments.networking.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "serviceattachments" in API group "networking.gke.io" in the namespace "default"
             error: /servicenetworkendpointgroups.networking.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "servicenetworkendpointgroups" in API group "networking.gke.io" in the namespace "default"
             error: /frontendconfigs.networking.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "frontendconfigs" in API group "networking.gke.io" in the namespace "default"
             error: /volumesnapshots.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "volumesnapshots" in API group "snapshot.storage.k8s.io" in the namespace "default"
             error: /deletebackuprequests.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "deletebackuprequests" in API group "velero.io" in the namespace "default"
             error: /backups.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "backups" in API group "velero.io" in the namespace "default"
             error: /downloadrequests.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "downloadrequests" in API group "velero.io" in the namespace "default"
             error: /volumesnapshotlocations.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "volumesnapshotlocations" in API group "velero.io" in the namespace "default"
             error: /serverstatusrequests.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "serverstatusrequests" in API group "velero.io" in the namespace "default"
             error: /backuprepositories.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "backuprepositories" in API group "velero.io" in the namespace "default"
             error: /restores.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "restores" in API group "velero.io" in the namespace "default"
             error: /backupstoragelocations.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "backupstoragelocations" in API group "velero.io" in the namespace "default"
             error: /podvolumebackups.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "podvolumebackups" in API group "velero.io" in the namespace "default"
             error: /schedules.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "schedules" in API group "velero.io" in the namespace "default"
             error: /podvolumerestores.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "podvolumerestores" in API group "velero.io" in the namespace "default"
             error: /datadownloads.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "datadownloads" in API group "velero.io" in the namespace "default"
             error: /datauploads.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "datauploads" in API group "velero.io" in the namespace "default"
             error: /updateinfos.nodemanagement.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "updateinfos" in API group "nodemanagement.gke.io" in the namespace "default"
  Cluster:    <none>
  Namespaces:
    default:   resource: /backendconfigs error: /backendconfigs.cloud.google.com is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "backendconfigs" in API group "cloud.google.com" in the namespace "default"
               resource: /capacityrequests error: /capacityrequests.internal.autoscaling.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "capacityrequests" in API group "internal.autoscaling.gke.io" in the namespace "default"
               resource: /managedcertificates error: /managedcertificates.networking.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "managedcertificates" in API group "networking.gke.io" in the namespace "default"
               resource: /serviceattachments error: /serviceattachments.networking.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "serviceattachments" in API group "networking.gke.io" in the namespace "default"
               resource: /servicenetworkendpointgroups error: /servicenetworkendpointgroups.networking.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "servicenetworkendpointgroups" in API group "networking.gke.io" in the namespace "default"
               resource: /frontendconfigs error: /frontendconfigs.networking.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "frontendconfigs" in API group "networking.gke.io" in the namespace "default"
               resource: /volumesnapshots error: /volumesnapshots.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "volumesnapshots" in API group "snapshot.storage.k8s.io" in the namespace "default"
               resource: /deletebackuprequests error: /deletebackuprequests.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "deletebackuprequests" in API group "velero.io" in the namespace "default"
               resource: /backups error: /backups.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "backups" in API group "velero.io" in the namespace "default"
               resource: /downloadrequests error: /downloadrequests.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "downloadrequests" in API group "velero.io" in the namespace "default"
               resource: /volumesnapshotlocations error: /volumesnapshotlocations.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "volumesnapshotlocations" in API group "velero.io" in the namespace "default"
               resource: /serverstatusrequests error: /serverstatusrequests.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "serverstatusrequests" in API group "velero.io" in the namespace "default"
               resource: /backuprepositories error: /backuprepositories.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "backuprepositories" in API group "velero.io" in the namespace "default"
               resource: /restores error: /restores.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "restores" in API group "velero.io" in the namespace "default"
               resource: /backupstoragelocations error: /backupstoragelocations.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "backupstoragelocations" in API group "velero.io" in the namespace "default"
               resource: /podvolumebackups error: /podvolumebackups.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "podvolumebackups" in API group "velero.io" in the namespace "default"
               resource: /schedules error: /schedules.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "schedules" in API group "velero.io" in the namespace "default"
               resource: /podvolumerestores error: /podvolumerestores.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "podvolumerestores" in API group "velero.io" in the namespace "default"
               resource: /datadownloads error: /datadownloads.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "datadownloads" in API group "velero.io" in the namespace "default"
               resource: /datauploads error: /datauploads.velero.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "datauploads" in API group "velero.io" in the namespace "default"
               resource: /updateinfos error: /updateinfos.nodemanagement.gke.io is forbidden: User "system:serviceaccount:velero:velero-server" cannot list resource "updateinfos" in API group "nodemanagement.gke.io" in the namespace "default"