vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.78k stars 1.41k forks source link

velero skipping volume backups if connected to S3 backblaze #8235

Open mehransaeed7810 opened 2 months ago

mehransaeed7810 commented 2 months ago

velero unable to backup volumes (pvcs) if connected to backblaze S3 as seen in the logs its skipping the pvcs `Phase: Completed

Warnings: Velero: Cluster: resource: /persistentvolumes name: /pvc-c982c1ef-0d86-469b-bb53-4c6cf9e1cf23 message: /No volume ID returned by volume snapshotter for persistent volume resource: /persistentvolumes name: /pvc-eceb274c-99bb-45cd-9e55-a56f3f7c0dda message: /No volume ID returned by volume snapshotter for persistent volume Namespaces: `

Describe the solution you'd like It would be good to have velero creating backups of pvcs as well and exporting it to backblaze

Environment:

blackpiglet commented 2 months ago

/No volume ID returned by volume snapshotter for persistent volume This error means the volume data is backed up by the Velero Native-Snapshot. The Native-Snapshot is provided by the Velero VolumeSnapshotter plugins. The Velero-supported plugins can be found here. https://velero.io/plugins/

As a result, it's better to use the CSI snapshot or the file-system backup for your scenario. https://velero.io/docs/v1.14/csi/ https://velero.io/docs/v1.14/csi-snapshot-data-movement/ https://velero.io/docs/v1.14/file-system-backup/

mehransaeed7810 commented 2 months ago

Thanks @blackpiglet for responding to it. I have tried this workaround as advised by using the CSI snapshot

installed these CRDs

git clone https://github.com/kubernetes-csi/external-snapshotter.git
cd external-snapshotter/config/crd

kubectl apply -f snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f snapshot.storage.k8s.io_volumesnapshotcontents.yaml

Then created this VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: vsphere-snapshotclass
  labels:
    velero.io/csi-volumesnapshot-class: "true" 
driver: csi.vsphere.vmware.com
deletionPolicy: Delete  

I can see its up and running

kubectl get VolumeSnapshotClass
NAME                    DRIVER                   DELETIONPOLICY   AGE
vsphere-snapshotclass   csi.vsphere.vmware.com   Delete           36m

However running the backup job again returns same errors as before

velero backup create uat-backup --include-namespaces uat-test

velero backup describe goco-uat-backup
Name:         goco-uat-backup
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.29.5+k3s1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=29

Phase:  Completed

Warnings:
  Velero:     <none>
  Cluster:   resource: /persistentvolumes name: /pvc-9b1769e2-3eea-4fda-894b-6055de6fc087 message: /No volume ID returned by volume snapshotter for persistent volume
             resource: /persistentvolumes name: /pvc-cb137903-fc77-4c68-80db-91139f242fd4 message: /No volume ID returned by volume snapshotter for persistent volume
             resource: /persistentvolumes name: /pvc-0a4dee1b-0c36-42f6-99a2-95336ce8b8d0 message: /No volume ID returned by volume snapshotter for persistent volume
             resource: /persistentvolumes name: /pvc-7e692840-988d-45a2-92c7-0a039ab405f7 message: /No volume ID returned by volume snapshotter for persistent volume
blackpiglet commented 2 months ago

I see. Please check whether you enabled the CSI feature flag in the Velero deployment. It can be enabled in the velero install CLI.

velero install \
--features=EnableCSI \
......

If you already installed the Velero, you can edit the Velero deployment to enable it too.

......
    spec:
      containers:
      - args:
        - server
        - --features=EnableCSI
        - --uploader-type=kopia
        command:
        - /velero
......
mehransaeed7810 commented 2 months ago

thanks @blackpiglet

I edited the deployment. looks like its trying to backup but stuck at

time="2024-09-24T07:31:09Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-cffqg. Retrying in 5s" backup=velero/goco-uat-backup cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-24T07:31:14Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-cffqg. Retrying in 5s" backup=velero/goco-uat-backup cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-24T07:31:19Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-cffqg. Retrying in 5s" backup=velero/goco-uat-backup cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-24T07:31:24Z" level=info msg="plugin process exited" backupLocation=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-sync id=344 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws
time="2024-09-24T07:31:24Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-cffqg. Retrying in 5s" backup=velero/goco-uat-backup cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-24T07:31:27Z" level=info msg="plugin process exited" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/goco-uat-backup-90536fad-bc0e-44e2-be64-d34fa83bc23c id=361 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws
time="2024-09-24T07:31:29Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-cffqg. Retrying in 5s" backup=velero/goco-uat-backup cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-24T07:31:34Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:141"
time="2024-09-24T07:31:34Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126"
time="2024-09-24T07:31:34Z" level=info msg="plugin process exited" backup-storage-location=velero/default cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location id=377 logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws
time="2024-09-24T07:31:34Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-cffqg. Retrying in 5s" backup=velero/goco-uat-backup cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-24T07:31:39Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-cffqg. Retrying in 5s" backup=velero/goco-uat-backup cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-24T07:31:44Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-cffqg. Retrying in 5s" backup=velero/goco-uat-backup cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero

After creating the backup job it creates the volumesnapshot of pvc but get stuck at Waiting for CSI driver to reconcile volumesnapshot. not sure its some plugin missing or something else

velero backup describe goco-uat-backup

Name:         goco-uat-backup
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.29.5+k3s1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=29

Phase:  InProgress

Namespaces:
  Included:  goco-uat-mongodb
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-09-24 08:28:29 +0100 BST
Completed:  <n/a>

Expiration:  2024-10-24 08:28:28 +0100 BST

Estimated total items to be backed up:  60
Items backed up so far:                 0

Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots: <none included or not detectable>

  Pod Volume Backups: <none included>
blackpiglet commented 2 months ago

I suppose your k8s environment already has the available CSI snapshot function. If the CSI snapshot works, then waiting for the CSI plugin to reconcile the VolumeSnapshot is expected. The Velero code has some timers to ensure the VolumeSnapshot's snapshot created correctly, and the VolumeSnapshot's status is ReadyToUse before the backup completion.

mehransaeed7810 commented 1 month ago

so I let the job finished but it failed and I can see volumes havent been backup at all. during the time it was running, kept complaining about Waiting for CSI driver to reconcile volumesnapshot

velero backup describe goco-uat-mongodb
Name:         goco-uat-mongodb
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.29.5+k3s1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=29

Phase:  PartiallyFailed (run `velero backup logs goco-uat-mongodb` for more information)

Errors:
  Velero:    name: /goco-uat-mongodb-0 message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=goco-uat-mongodb, name=velero-data-volume-goco-uat-mongodb-0-gq2pc): rpc error: code = Unknown desc = failed to get volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-0-gq2pc: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
             message: /Timed out awaiting reconciliation of volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-1-trtrz
             name: /goco-uat-mongodb-1 message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=goco-uat-mongodb, name=velero-data-volume-goco-uat-mongodb-1-trtrz): rpc error: code = Unknown desc = failed to get volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-1-trtrz: client rate limiter Wait returned an error: context deadline exceeded
             message: /Timed out awaiting reconciliation of volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-2-8r922
             name: /goco-uat-mongodb-2 message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=goco-uat-mongodb, name=velero-logs-volume-goco-uat-mongodb-2-8r922): rpc error: code = Unknown desc = failed to get volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-2-8r922: client rate limiter Wait returned an error: context deadline exceeded
             message: /Timed out awaiting reconciliation of volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-2-zn92q
             name: /data-volume-goco-uat-mongodb-2 message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=goco-uat-mongodb, name=velero-data-volume-goco-uat-mongodb-2-zn92q): rpc error: code = Unknown desc = failed to get volumesnapshot goco-uat-mongodb/velero-data-volume-goco-uat-mongodb-2-zn92q: client rate limiter Wait returned an error: context deadline exceeded
             message: /Timed out awaiting reconciliation of volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-0-677p6
             name: /logs-volume-goco-uat-mongodb-0 message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=goco-uat-mongodb, name=velero-logs-volume-goco-uat-mongodb-0-677p6): rpc error: code = Unknown desc = failed to get volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-0-677p6: client rate limiter Wait returned an error: context deadline exceeded
             message: /Timed out awaiting reconciliation of volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-1-n2mq2
             name: /logs-volume-goco-uat-mongodb-1 message: /Error backing up item error: /error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=goco-uat-mongodb, name=velero-logs-volume-goco-uat-mongodb-1-n2mq2): rpc error: code = Unknown desc = failed to get volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-1-n2mq2: client rate limiter Wait returned an error: context deadline exceeded
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  goco-uat-mongodb
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-09-26 12:08:12 +0100 BST
Completed:  2024-09-26 13:08:26 +0100 BST

Expiration:  2024-10-26 12:08:12 +0100 BST

Total items to be backed up:  74
Items backed up:              74

Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots: <none included>

  Pod Volume Backups: <none included>

As we can see above in backup volumes, says none included does it mean its still not processing volumes at all.

I can see in the logs it eventually timed out awaiting reconciliation of volumesnapshot

time="2024-09-26T11:58:06Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-0-677p6. Retrying in 5s" backup=velero/goco-uat-mongodb cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-26T11:58:11Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-0-677p6. Retrying in 5s" backup=velero/goco-uat-mongodb cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:713" pluginName=velero
time="2024-09-26T11:58:16Z" level=error msg="Timed out awaiting reconciliation of volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-0-677p6" backup=velero/goco-uat-mongodb cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:767" pluginName=velero
time="2024-09-26T11:58:16Z" level=info msg="Deleting Volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-0-677p6" backup=velero/goco-uat-mongodb cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:486" pluginName=velero
time="2024-09-26T11:58:16Z" level=info msg="Deleted volumesnapshot with volumesnapshotContent goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-0-677p6" backup=velero/goco-uat-mongodb cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:514" pluginName=velero
time="2024-09-26T11:58:16Z" level=info msg="1 errors encountered backup up item" backup=velero/goco-uat-mongodb logSource="pkg/backup/backup.go:507" name=logs-volume-goco-uat-mongodb-0
time="2024-09-26T11:58:16Z" level=error msg="Error backing up item" backup=velero/goco-uat-mongodb error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=goco-uat-mongodb, name=velero-logs-volume-goco-uat-mongodb-0-677p6): rpc error: code = Unknown desc = failed to get volumesnapshot goco-uat-mongodb/velero-logs-volume-goco-uat-mongodb-0-677p6: client rate limiter Wait returned an error: context deadline exceeded" logSource="pkg/backup/backup.go:511" name=logs-volume-goco-uat-mongodb-0
time="2024-09-26T11:58:16Z" level=info msg="Backed up 19 items out of an estimated total of 69 (estimate will change throughout the backup)" backup=velero/goco-uat-mongodb logSource="pkg/backup/backup.go:452" name=logs-volume-goco-uat-mongodb-0 namespace=goco-uat-mongodb progress= resource=persistentvolumeclaim
blackpiglet commented 1 month ago

Thanks for your feedback. I think this is related to the BackBlaze CSI function. Could you check after the backup failed, what is the status of the created VolumeSnapshot?

It's better to find the CSI external snapshotter and snapshot controller's log to help understand what went wrong.