vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.66k stars 1.39k forks source link

problem removing protection finalizer (Related to issue #7068) #7433

Closed gh-tek closed 7 months ago

gh-tek commented 8 months ago

What steps did you take and what happened: Backing up using volume snapshot with data movement to S3 storage, backup ends "PartiallyFailed". Snapshot data is not moved (fails on expose already), snapshot content resources are set to be deleted, but they are not removed, because removal of protection finalizer fails, preventing removal.

Error msg states that expose fails, protection removal actually tries to set wrong (unrelated) field spec.SourceVolumeMode on snapshot content (clip from log below point's to the code trying that removal, it's pretty recent code)

What did you expect to happen: There should not be change attempts on immutable fields ever, I guess that information is lost somewhere, or expected that it is unset (in my case it is 'Filesystem', not nil). Protection removal should succeed, snapshot content be removed and backup end with success.

The following information will help us better understand what's going on:

Debug bundle has too much information to be put out publicly, but here is node-agent log containing full error:

time="2024-02-15T07:57:05Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:05Z" level=warning msg="failed to add finalizer for velero/testbackup-kjfdv and will requeue later" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:144"
time="2024-02-15T07:57:05Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:05Z" level=info msg="Data upload starting" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:161"
time="2024-02-15T07:57:05Z" level=info msg="Accepting data upload testbackup-kjfdv" logSource="pkg/controller/data_upload_controller.go:706"
time="2024-02-15T07:57:05Z" level=info msg="This datauplod has been accepted by others" Dataupload=testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:735"
time="2024-02-15T07:57:05Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:08Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:08Z" level=info msg="Reconcile testbackup-kjfdv" controller=dataupload dataupload=velero/testbackup-kjfdv logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:10Z" level=info msg="Reconcile testbackup-wpltz" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:10Z" level=info msg="Data upload starting" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:161"
time="2024-02-15T07:57:10Z" level=info msg="Accepting data upload testbackup-wpltz" logSource="pkg/controller/data_upload_controller.go:706"
time="2024-02-15T07:57:10Z" level=info msg="This datauplod has been accepted by ibuki01" Dataupload=testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:731"
time="2024-02-15T07:57:10Z" level=info msg="Data upload is accepted" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:173"
time="2024-02-15T07:57:10Z" level=info msg="Exposing CSI snapshot" logSource="pkg/exposer/csi_snapshot.go:101" owner=testbackup-wpltz
time="2024-02-15T07:57:10Z" level=info msg="Volumesnapshot is ready" logSource="pkg/exposer/csi_snapshot.go:108" owner=testbackup-wpltz
time="2024-02-15T07:57:10Z" level=info msg="Got VSC from VS in namespace postgres" logSource="pkg/exposer/csi_snapshot.go:115" owner=testbackup-wpltz vs name=velero-pgdata-pg-passbolt-0-s2mn2 vsc name=snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3
time="2024-02-15T07:57:10Z" level=info msg="Finished to retain VSC" logSource="pkg/exposer/csi_snapshot.go:122" owner=testbackup-wpltz retained=true vsc name=snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3
time="2024-02-15T07:57:13Z" level=info msg="VS is deleted in namespace postgres" logSource="pkg/exposer/csi_snapshot.go:135" owner=testbackup-wpltz vs name=velero-pgdata-pg-passbolt-0-s2mn2
1.7079838330763552e+09  ERROR   Reconciler error    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "dataUpload": {"name":"testbackup-wpltz","namespace":"velero"}, "namespace": "velero", "name": "testbackup-wpltz", "reconcileID": "c7239a65-f7e3-43ee-9ca3-5eeaa410cc94", "error": "error to remove protect from volume snapshot content: error to update VolumeSnapshotContent snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3: admission webhook \"snapshot-validation-webhook.snapshot.storage.k8s.io\" denied the request: Spec.SourceVolumeMode is immutable but was changed from Filesystem to nil", "errorVerbose": "admission webhook \"snapshot-validation-webhook.snapshot.storage.k8s.io\" denied the request: Spec.SourceVolumeMode is immutable but was changed from Filesystem to nil\nerror to update VolumeSnapshotContent snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3\ngithub.com/vmware-tanzu/velero/pkg/util/csi.RemoveVSCProtect.func1\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:170\nk8s.io/apimachinery/pkg/util/wait.PollImmediate.ConditionFunc.WithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:222\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:582\nk8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:528\nk8s.io/apimachinery/pkg/util/wait.PollImmediate\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.6/pkg/util/wait/wait.go:514\ngithub.com/vmware-tanzu/velero/pkg/util/csi.RemoveVSCProtect\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:156\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:137\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650\nerror to remove protect from volume snapshot content\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:139\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:234
time="2024-02-15T07:57:13Z" level=info msg="Reconcile testbackup-wpltz" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:113"
time="2024-02-15T07:57:13Z" level=info msg="Reconcile testbackup-wpltz" controller=dataupload dataupload=velero/testbackup-wpltz logSource="pkg/controller/data_upload_controller.go:113"

Snapshot content is left in state:

Name:         snapcontent-496987f0-cd44-445e-a680-962ba2dd65f3
Namespace:    
Labels:       <none>
Annotations:  snapshot.storage.kubernetes.io/deletion-secret-name: rook-csi-cephfs-provisioner
              snapshot.storage.kubernetes.io/deletion-secret-namespace: rook-ceph
              snapshot.storage.kubernetes.io/volumesnapshot-being-deleted: yes
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshotContent
Metadata:
  Creation Timestamp:             2024-02-15T07:57:05Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2024-02-15T07:57:13Z
  Finalizers:
    velero.io/volume-snapshot-content-protect-finalizer

...

Spec:
  Deletion Policy:  Retain
  Driver:           rook-ceph.cephfs.csi.ceph.com
  Source:
    Volume Handle:             0001-0009-rook-ceph-0000000000000001-9abde5e0-6841-49d3-a146-abc0dc3f7e07
  Source Volume Mode:          Filesystem
  Volume Snapshot Class Name:  rook-ceph-snapclass
  Volume Snapshot Ref:
    API Version:       snapshot.storage.k8s.io/v1
    Kind:              VolumeSnapshot
    Name:              velero-pgdata-pg-passbolt-0-s2mn2
    Namespace:         postgres
    Resource Version:  31363782
    UID:               496987f0-cd44-445e-a680-962ba2dd65f3
Status:
  Creation Time:    1707983826107097000
  Ready To Use:     true
  Restore Size:     10737418240
  Snapshot Handle:  0001-0009-rook-ceph-0000000000000001-d352618f-11fb-484f-9406-560161f3a123
Events:             <none>

Anything else you would like to add:

This could well be configuration error on my side (which I haven't figured out yet), but I think that trying to change immutable field on snapshot content resource may be some kind of bug, that should not happen... I'm also using the latest external snapshotted on my cluster, with latest snapshot CRD's.

Also, I have verified that snapshotting works when you use --snapshot-move-data=false. Snapshot is saved and is restorable. But when trying to use default snapshot mover, it fails before moving anything (it fails on expose already).

This is a new install, I have never used Velero before, so I have no experience from older versions. But snapshotting works without data move and also FSB works (but I would rather use snapshotting with data move).

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

gh-tek commented 8 months ago

Shortly after writing this, I figured out that this is related to a rather recent change on external snapshotter. I have version 7.0.1. https://github.com/kubernetes-csi/external-snapshotter/tree/v7.0.1

I found out that there is a parameter --prevent-volume-mode-conversion and its default value has recently changed from false to true as that feature is progressing. That actually forces the validation for the aforementioned field.

I was able to give --prevent-volume-mode-conversion=false and then Velero started working, backup went just fine.

So I guess this may not be a bug, but more of a feature request. Velero probably needs some changes because of that change in snapshotter. I guess it should work with default setting too.

Anyway, backup works for me right now, but if you need help with testing this, I can help with that.

Lyndon-Li commented 8 months ago

This is due to the low SDK version velero is using where Spec.SourceVolumeMode field doesn't exist. So the problem only happens in the environment that uses external snapshotter v7.0 and higher.

ksyblast commented 7 months ago

We have the same issue with storage class driver linstor.csi.linbit.com Operation Error: error to expose snapshot: error to remove protect from volume snapshot content: error to update VolumeSnapshotContent snapcontent-2a34e711-1149-4dd5-a80a-2161bac0fca6: admission webhook "snapshot-validation-webhook.snapshot.storage.k8s.io" denied the request: Spec.SourceVolumeMode is immutable but was changed from Filesystem to nil

However the issue is floating and sometimes backup works. No idea why it's still working from time to time

velero version: v1.12.2, csi addon 0.7.0 K8s version: v1.27.5, bare metal CSI dirver: piraeusdatastore CSI driver v1.3.0, snapshot-controller 7.0.1

ywk253100 commented 7 months ago

Hi @gh-tek @ksyblast, I upgraded several packages used by Velero to fix this issue, but we have no environment with the latest version of the snapshot-controller installed. Do you have any chance to help us verify the fix with the images yinw/velero:lib-bump01 and yinw/velero-plugin-for-csi:lib-bump01?

gh-tek commented 7 months ago

Hi @ywk253100,

Sorry for the delay, gmail desided to throw my emails to Junk and I missed your message :)

I tested your images:

I have been on Velero 1.13, csi plugin 0.7.0 and external snapshotter 6.3.3. Which is a working setup.

I upgraded external snapshotter to 7.0.1 and tested taking backup. It failed as expected. I then switched to the bump -images you gave and retested same backup. They now work flawlessly with newest external snapshotter!

Based on this test, should be ok to merge. I will now go back to old version and wait for actual release. Thanks!

ywk253100 commented 7 months ago

Fixed by https://github.com/vmware-tanzu/velero/pull/7489