vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.66k stars 1.39k forks source link

FSB backup on EFS Volume I asking for a Volume snapshot Class for provider #8278

Open darnone opened 3 days ago

darnone commented 3 days ago

What steps did you take and what happened: Hello, me again. and tI thank you for your reply in advance. I am now working on FSB. Backups are failing with a message:

Errors:
  Velero:    name: /cloudbees-efs-cloudbees-ci-persistence message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=cloudbees-efs-cloudbees-ci-persistence): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com, 
            ensure that the desired VolumeSnapshot class has the velero.io/csi-volumesnapshot-class label
             name: /jenkins-home-aam-aam-controller-1-0 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=jenkins-home-aam-aam-controller-1-0): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com, 
            ensure that the desired VolumeSnapshot class has the velero.io/csi-volumesnapshot-class label
             name: /jenkins-home-aegon-gts-aegon-gts-admin-controller-1-0 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=jenkins-home-aegon-gts-aegon-gts-admin-controller-1-0): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com, 
            ensure that the desired VolumeSnapshot class has the velero.io/csi-volumesnapshot-class label
             name: /jenkins-home-aegonai-controller-1-0 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=jenkins-home-aegonai-controller-1-0): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com, 
            ensure that the desired VolumeSnapshot class has the velero.io/csi-volumesnapshot-class label
             name: /jenkins-home-cloud-foundation-0 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=jenkins-home-cloud-foundation-0): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com, 
            ensure that the desired VolumeSnapshot class has the velero.io/csi-volumesnapshot-class label
             name: /jenkins-home-cloud-fusion-0 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=jenkins-home-cloud-fusion-0): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com, 
            ensure that the desired VolumeSnapshot class has the velero.io/csi-volumesnapshot-class label

When I was debugging this under tickets 8269/8262, I received a similar message and it turned out it could not find a volumesnapshotclass due to a typo. But my understanding is FSB isn't supported by the EFS CSI so why is it asking for a VolumeSnapshotClass? I have a small sample project that works.

I found the storage class for cloudbees was missing the annotation so I added it manually using kubectl edit. Will that work or do I need to bring the whole application down and reprovision it?

What did you expect to happen: The backup describe looks like this worked for the k8s objects and there is data under /kopia. I noticed that not all namespaces are backed up, there are a couple of cloudbees namespaces that are not included in the list.

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

bundle-2024-10-08-11-35-57.tar.gz

Anything else you would like to add: My configuration is as follows. Should I remove the EnabeCSI and volume snapshot location. But again my example works

configuration:
  features: EnableCSI
  uploaderType: kopia
  backupStorageLocation:
  - name: velero-backup-storage-location
    bucket: {{ .Values.velero_backups_bucket }}
    default: true
    provider: aws
    config:
      region: us-east-1
  volumeSnapshotLocation:
  - name: velero-volume-storage-location
    provider: aws
    config:
      region: us-east-1

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

darnone commented 3 days ago

Backup desctbie attached here: describe-fsb.txt

darnone commented 3 days ago

I have this wrong. For FSB backups the storage class does not need a label and a volumesnapshotclass is not meeded either. I made the change to my example andit backupd and restores efs. I reverted the manual changes to the clioudbees storageclass (no label and there is no volume snapshot class. But the cloudbees backup is still complaining about a missing VolumeSnapshotClass for provider efs. I don't know why this happening.

darnone commented 3 days ago

I see this in the log but these pvc don't exist.

time="2024-10-08T18:36:15Z" level=info msg="Summary for skipped PVs: [{\"name\":\"pvc-01956941-be28-4fbf-86fa-2c403b02ba0c\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-08584729-c37d-4a76-9d3a-6ee2edf66bc0\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-16aa3321-5075-4f03-aeaf-4f681e0f0806\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-23e32673-b3c2-491a-864e-f63c961d4866\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-2b5c2f6a-361a-4edc-a54c-505f54423783\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-2dc4f712-1ad6-429b-a43b-9755cc2a3bd3\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-2ea275f9-a5de-44db-8077-fd6f6654fb41\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-302ef10a-e062-4307-9297-aa1aadf3f012\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-361c1f06-08d4-459b-a976-fefa9fa0f1d5\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-37a49ad5-95bb-4353-a7d6-740a950749b5\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-59b56419-c714-4e05-9226-078fcc4c6f1d\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-64eb86d7-6693-47b1-9709-7da621e42699\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-736bd1a1-e5d5-474f-b9bd-e5e672353f1c\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-749b754e-2b73-4f22-96f7-b2e9a430deac\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-86395706-723e-45f1-b488-95fab34a4dc6\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-8a0cff97-9986-41a1-8e1a-4e85abd4067c\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-8f9f86c4-ca40-4dbb-b62a-2f5e0157273b\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-8fa84540-e1ef-4b1f-9df9-76d2cba7915a\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-957b074d-0e24-4336-9ddf-d0e78787d7ea\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-a1c805f4-fd5c-4694-a8c9-822dfd4fc0c5\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-a210c1d5-0e04-43b1-889b-fdb530ac1851\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-b15c03f3-2312-42b8-8b2a-e4cbc9a621f1\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-c4d22f7d-ec41-486f-b7be-419595fbf966\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-d0d17ccb-ac24-4d9b-a6bf-4d98900d18f7\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-d94402fa-678d-4c0e-8f0a-965be04ab136\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-de21f865-6719-4f8b-8acf-38c73ed4fdd5\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-e7713035-6ee4-4011-902d-2f32a5def841\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-ee799942-2415-4f4b-8b61-7cf8adef5916\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]},{\"name\":\"pvc-ee8dd927-b20d-46a2-98d2-2ef5be063cbd\",\"reasons\":[{\"approach\":\"volumeSnapshot\",\"reason\":\"not satisfy the criteria for VolumePolicy or the legacy snapshot way\"}]}]" backup=velero/gts-cloudbees-ci-dev-schedule-fsb logSource="pkg/backup/backup.go:495"
time="2024-10-08T18:36:15Z" level=info msg="Backed up a total of 1053 items" backup=velero/gts-cloudbees-ci-dev-schedule-fsb logSource="pkg/backup/backup.go:499" progress=
darnone commented 3 days ago

On the phone with Amazon and ran the cycle again:

  1. Removed Velero deployment
  2. Removed velero namspace
  3. Cleared out S3
  4. Redeployed Velero
  5. Create backup: velero create backup gts-cloudbees-ci-dev-backup-fsb --default-volumes-to-fs-backup=true --include-namespaces cloudbees
  6. Backup partially failed: Errors: Velero: name: /cloudbees-efs-cloudbees-ci-persistence message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=cloudbees-efs-cloudbees-ci-persistence): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com,
  7. Ran a restore: velero create restore gts-cloudbees-ci-dev-restore-fsb2 --from-backup gts-cloudbees-ci-dev-backup-fsb2 --include-namespaces cloudbees --namespace-mappings cloudbees:cloudbees-tmp
  8. Restore replace everything in different namespace as expected - all pods andb pvdc were in place - only issue was ALBs did not bind because they were already bound.

IS THIS A BUG???

ywk253100 commented 3 days ago

Are there some orphan PVCs (PVCs that are not bind to any pods) in your case? The File System Backup only backs up the PVCs bind to pods and the orphan PVCs will be handled by the CSI snapshot which needs the labeled VolumeSnapshotClass. If you don't want to backup the orphan PVCs, you can set --snapshot-volumes=false when creating the backup.

darnone commented 3 days ago

Wenkai, thank you for your reply. I have no orphaned pvcs anywhere in that cluster that I can see:

 k get pvc -A
NAMESPACE    NAME                                                                                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                             AGE
cloudbees    cloudbees-efs-cloudbees-ci-persistence                                                         Bound    pvc-c16f9430-cf97-46b1-a9c9-2b31cf93b031   5Gi        RWX            cloudbees-efs-cloudbees-ci-persistence   89d
cloudbees    jenkins-home-aam-aam-controller-1-0                                                            Bound    pvc-99e6b566-eb8b-4ea0-8dc9-1a2f4fdb3a9f   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   48d
cloudbees    jenkins-home-aegon-gts-aegon-gts-admin-controller-1-0                                          Bound    pvc-8bf60237-af5d-4a28-855b-4b939815f9c6   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   48d
cloudbees    jenkins-home-aegonai-controller-1-0                                                            Bound    pvc-2e75efab-3698-403a-a7c7-1b128560aa6a   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   48d
cloudbees    jenkins-home-cjoc-0                                                                            Bound    pvc-d668a34f-2d68-4514-9a72-7bbc0628c4cb   20Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   89d
cloudbees    jenkins-home-cloud-foundation-0                                                                Bound    pvc-1da42f83-2acd-43fc-92be-bea2157ae1d8   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   48d
cloudbees    jenkins-home-cloud-fusion-0                                                                    Bound    pvc-a637460c-c145-4ce4-a03b-690f4e1f6a67   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   48d
monitoring   kube-prometheus-stack-grafana                                                                  Bound    pvc-69b5940a-f3a7-452e-b340-74beaaf7c46f   5Gi        RWO            kube-grafana-sc                          4d22h
monitoring   prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0   Bound    pvc-444a5ea4-2181-4546-a9d1-88f787e55e44   5Gi        RWO            kube-prometheus-sc                       4d22h

Although, the first in the list shown (the 5 GB) isn't really attached to any pod directly and seems to be the offending pod

ywk253100 commented 2 days ago

Orphaned PVCs are PVCs that not attach to any pod, are these PVCs in the list attached to any pod? Could you show me the yaml of pods that these PVCs are attached?

darnone commented 2 days ago

All of the cloudbees pvcs are attached to cloudbees contoller pods except fro the first one although it shows as bound. The pods are create from a statefulset. Attached are the pods for these controllers

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE cloudbees-efs-cloudbees-ci-persistence Bound pvc-c16f9430-cf97-46b1-a9c9-2b31cf93b031 5Gi RWX cloudbees-efs-cloudbees-ci-persistence 90d jenkins-home-aam-aam-controller-1-0 Bound pvc-99e6b566-eb8b-4ea0-8dc9-1a2f4fdb3a9f 50Gi RWO cloudbees-efs-cloudbees-ci-persistence 49d jenkins-home-aegon-gts-aegon-gts-admin-controller-1-0 Bound pvc-8bf60237-af5d-4a28-855b-4b939815f9c6 50Gi RWO cloudbees-efs-cloudbees-ci-persistence 49d jenkins-home-aegonai-controller-1-0 Bound pvc-2e75efab-3698-403a-a7c7-1b128560aa6a 50Gi RWO cloudbees-efs-cloudbees-ci-persistence 49d jenkins-home-cjoc-0 Bound pvc-d668a34f-2d68-4514-9a72-7bbc0628c4cb 20Gi RWO cloudbees-efs-cloudbees-ci-persistence 90d jenkins-home-cloud-foundation-0 Bound pvc-1da42f83-2acd-43fc-92be-bea2157ae1d8 50Gi RWO cloudbees-efs-cloudbees-ci-persistence 49d jenkins-home-cloud-fusion-0 Bound pvc-a637460c-c145-4ce4-a03b-690f4e1f6a67 50Gi RWO cloudbees-efs-cloudbees-ci-persistence 49d

pods.zip

ywk253100 commented 1 day ago

Let's focus on the PVC aegon-gts-aegon-gts-admin-controller-1-0 to debug the issue.

The PVC aegon-gts-aegon-gts-admin-controller-1-0 is attached to pod aegon-gts-aegon-gts-admin-controller-1-0, but the pod aegon-gts-aegon-gts-admin-controller-1-0 isn't included in the backup:

  apps/v1/StatefulSet:
    - cloudbees/aam-aam-controller-1
    - cloudbees/aegon-gts-aegon-gts-admin-controller-1
    - cloudbees/aegonai-controller-1
    - cloudbees/cjoc
    - cloudbees/cloud-foundation
    - cloudbees/cloud-fusion
    - monitoring/alertmanager-kube-prometheus-stack-alertmanager
    - monitoring/prometheus-kube-prometheus-stack-prometheus
...
  v1/PersistentVolumeClaim:
    - cloudbees/cloudbees-efs-cloudbees-ci-persistence
    - cloudbees/jenkins-home-aam-aam-controller-1-0
    - cloudbees/jenkins-home-aegon-gts-aegon-gts-admin-controller-1-0
    - cloudbees/jenkins-home-aegonai-controller-1-0
    - cloudbees/jenkins-home-cjoc-0
    - cloudbees/jenkins-home-cloud-foundation-0
    - cloudbees/jenkins-home-cloud-fusion-0
    - efs-csi-snapshot/efs-pvc
    - monitoring/kube-prometheus-stack-grafana
    - monitoring/prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0
 ...
  v1/Pod:
    - cloudbees-sidecar-injector/cloudbees-sidecar-injector-28752480-vnttb
    - cloudbees-sidecar-injector/cloudbees-sidecar-injector-28772640-4glxg
    - cloudbees-sidecar-injector/cloudbees-sidecar-injector-28795680-nr5db
    - cloudbees-sidecar-injector/cloudbees-sidecar-injector-5ddf98b97d-jlf8s
    - cloudbees/cjoc-0
    - cloudbees/managed-master-hibernation-monitor-b8f7bd474-bq5fl
    - efs-csi-snapshot/efs-snapshot-deploy-6d6956ff5f-vkpz4
    - external-snapshotter/external-snapshotter-snapshot-controller-69d4456fdf-vhrlz
    - external-snapshotter/external-snapshotter-snapshot-validation-webhook-5f76cbfd8k4fq2
 ...

So snapshot is used for the PVCs.

Could you check whether the pods are created when you taking the backup? It's possible that the pods aren't running yet when you taking the backup.

darnone commented 21 hours ago

Wenkai, I am not seeing above where the aegon-gts-aegon-gts-admin-controller-1-0 pod is not in the backup. To be through, all pods are running and all PVCs are bound - see attached:

pods.txt pvc.txt

then the backup: velero create backup gts-cloudbees-ci-dev-backup-fsb --default-volumes-to-fs-backup=true

Describe is: backup-describe.txt

One thing in he describe that is new are these warnings:

velero:   resource: /pods name: /repo-maintain-job-1728657244257-hxfw7 message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728657244257-hxfw7, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728657244257-hxfw7 message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728657244257-hxfw7, namespace=velero, phase=Succeeded: pod is not running
k get pods -n velero 
NAME                                    READY   STATUS      RESTARTS   AGE
node-agent-j2z5m                        1/1     Running     0          2d3h
node-agent-l874j                        1/1     Running     0          2d3h
node-agent-l9rnf                        1/1     Running     0          2d3h
node-agent-n2ggf                        1/1     Running     0          2d3h
repo-maintain-job-1728657544257-788gw   0/1     Completed   0          179m
repo-maintain-job-1728660244263-sldkj   0/1     Completed   0          134m
repo-maintain-job-1728660544264-r5gzp   0/1     Completed   0          129m
repo-maintain-job-1728660844265-59snx   0/1     Completed   0          124m
repo-maintain-job-1728661144265-8txhx   0/1     Completed   0          119m
repo-maintain-job-1728663844272-bh7m9   0/1     Completed   0          74m
repo-maintain-job-1728664144272-9sdwm   0/1     Completed   0          69m
repo-maintain-job-1728664444273-7pzxb   0/1     Completed   0          64m
repo-maintain-job-1728664744274-td6fp   0/1     Completed   0          59m
repo-maintain-job-1728667444282-s9jc2   0/1     Completed   0          14m
repo-maintain-job-1728667744282-hd798   0/1     Completed   0          9m45s
repo-maintain-job-1728668044283-697ql   0/1     Completed   0          4m45s
velero-75d98b497-wf7df                  1/1     Running     0          2d3h

I am not sure what these jobs are or how to prevent them from appearing In the backup. Should I not be backing up the velero namespace? Are they from the schedule? If I exclude the velero namepace they go away.

darnone commented 21 hours ago

In the describe I see the statefulset, pod, PVC , and PV but still have the error

apps/v1/StatefulSet:
    - cloudbees/aegon-gts-aegon-gts-admin-controller-1
 v1/Pod:
    - cloudbees/aegon-gts-aegon-gts-admin-controller-1-0
 v1/PersistentVolumeClaim:
    - cloudbees/jenkins-home-aegon-gts-aegon-gts-admin-controller-1-0
v1/PersistentVolume:
    - pvc-8bf60237-af5d-4a28-855b-4b939815f9c6

The rare a whole lot of PVCs mention in th e backup. I don't where they are coming from:

k get pvc -A
NAMESPACE    NAME                                                                                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                             AGE
cloudbees    cloudbees-efs-cloudbees-ci-persistence                                                         Bound    pvc-c16f9430-cf97-46b1-a9c9-2b31cf93b031   5Gi        RWX            cloudbees-efs-cloudbees-ci-persistence   92d
cloudbees    jenkins-home-aam-aam-controller-1-0                                                            Bound    pvc-99e6b566-eb8b-4ea0-8dc9-1a2f4fdb3a9f   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   50d
cloudbees    jenkins-home-aegon-gts-aegon-gts-admin-controller-1-0                                          Bound    pvc-8bf60237-af5d-4a28-855b-4b939815f9c6   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   50d
cloudbees    jenkins-home-aegonai-controller-1-0                                                            Bound    pvc-2e75efab-3698-403a-a7c7-1b128560aa6a   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   50d
cloudbees    jenkins-home-cjoc-0                                                                            Bound    pvc-d668a34f-2d68-4514-9a72-7bbc0628c4cb   20Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   92d
cloudbees    jenkins-home-cloud-foundation-0                                                                Bound    pvc-1da42f83-2acd-43fc-92be-bea2157ae1d8   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   50d
cloudbees    jenkins-home-cloud-fusion-0                                                                    Bound    pvc-a637460c-c145-4ce4-a03b-690f4e1f6a67   50Gi       RWO            cloudbees-efs-cloudbees-ci-persistence   50d
monitoring   kube-prometheus-stack-grafana                                                                  Bound    pvc-69b5940a-f3a7-452e-b340-74beaaf7c46f   5Gi        RWO            kube-grafana-sc                          7d3h
monitoring   prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0   Bound    pvc-444a5ea4-2181-4546-a9d1-88f787e55e44   5Gi        RWO            kube-prometheus-sc                       7d3h
ywk253100 commented 13 hours ago

According to the latest output of backup description you provided, seems everything works as expected now.

Name:         gts-cloudbees-ci-dev-backup-fsb
Namespace:    velero
Labels:       velero.io/storage-location=velero-backup-storage-location
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.28.13-eks-a737599
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=28+

Phase:  PartiallyFailed (run `velero backup logs gts-cloudbees-ci-dev-backup-fsb` for more information)

Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    velero:   resource: /pods name: /repo-maintain-job-1728657244257-hxfw7 message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728657244257-hxfw7, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728657244257-hxfw7 message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728657244257-hxfw7, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728657544257-788gw message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728657544257-788gw, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728657544257-788gw message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728657544257-788gw, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728660244263-sldkj message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728660244263-sldkj, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728660244263-sldkj message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728660244263-sldkj, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728660544264-r5gzp message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728660544264-r5gzp, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728660544264-r5gzp message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728660544264-r5gzp, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728660844265-59snx message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728660844265-59snx, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728660844265-59snx message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728660844265-59snx, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728661144265-8txhx message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728661144265-8txhx, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728661144265-8txhx message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728661144265-8txhx, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728663844272-bh7m9 message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728663844272-bh7m9, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728663844272-bh7m9 message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728663844272-bh7m9, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728664144272-9sdwm message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728664144272-9sdwm, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728664144272-9sdwm message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728664144272-9sdwm, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728664444273-7pzxb message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728664444273-7pzxb, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728664444273-7pzxb message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728664444273-7pzxb, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728664744274-td6fp message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728664744274-td6fp, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728664744274-td6fp message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728664744274-td6fp, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728667444282-s9jc2 message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728667444282-s9jc2, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728667444282-s9jc2 message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728667444282-s9jc2, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728667744282-hd798 message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728667744282-hd798, namespace=velero, phase=Succeeded: pod is not running
              resource: /pods name: /repo-maintain-job-1728667744282-hd798 message: /Skip pod volume scratch error: /pod is not in the expected status, name=repo-maintain-job-1728667744282-hd798, namespace=velero, phase=Succeeded: pod is not running

Errors:
  Velero:    name: /cloudbees-efs-cloudbees-ci-persistence message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=cloudbees-efs-cloudbees-ci-persistence): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com, 
            ensure that the desired VolumeSnapshot class has the velero.io/csi-volumesnapshot-class label
  Cluster:    <none>
  Namespaces: <none>
...

There is only one error this time, the PVC cloudbees-efs-cloudbees-ci-persistence isn't attached to any pods, right (according to the comment )?

Errors:
  Velero:    name: /cloudbees-efs-cloudbees-ci-persistence message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=cloudbees, name=cloudbees-efs-cloudbees-ci-persistence): rpc error: code = Unknown desc = failed to get VolumeSnapshotClass for StorageClass cloudbees-efs-cloudbees-ci-persistence: error getting VolumeSnapshotClass: failed to get VolumeSnapshotClass for provisioner efs.csi.aws.com, 
            ensure that the desired VolumeSnapshot class has the velero.io/csi-volumesnapshot-class label

The velero namespace should not be backed up, exclude it from the backup, the following warning message should go away:

Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    velero:   resource: /pods name: /repo-maintain-job-1728657244257-hxfw7 message: /Skip pod volume plugins error: /pod is not in the expected status, name=repo-maintain-job-1728657244257-hxfw7, namespace=velero, phase=Succeeded: pod is not running
...

BTW, if you only want to back up a specific namespace, use --include-namespaces <namespace> option when creating the backup.