Data mover: when a data move fails, the created disk (and snapshot) are not cleaned

Gui13 commented 3 weeks ago

What steps did you take and what happened:

We are using velero 1.14.0 on Azure AKS, with the data mover feature, and are suffering from the bug #7898. All our backups are partially failed, due to data move being canceled.

We are awaiting the 1.14.1, but in the meantime, we have an ongoing issue with the side effect of the bug: when the data mover job fails, it doesn't clean the created managed disk that were to be used by the data mover.

This is causing a runaway cost increase, because we have 2000+ (at the moment) provisioned disks which are not cleaned, and take up daily costs.

Typical output in the error section of the velero backup is as such:

message: /DataUpload is canceled, plugin: velero.io/csi-pvc-backupper
message: /DataUpload is canceled, plugin: velero.io/csi-pvc-backupper
message: /DataUpload is canceled, plugin: velero.io/csi-pvc-backupper
message: /DataUpload is canceled, plugin: velero.io/csi-pvc-backupper
[ times 163 in our case ]

Although the dataupload is "canceled", the remaining resources are not cleaned. Deleting the faulty backup will not release the created disks. We tried to velero delete backup <FaultyBackupWith163LingeringDisk> but this didn't work (although the backup was correctly deleted)

We are removing these disks manually right now, but I think they should be cleaned by the Velero data mover as a last-ditch, even if the data move failed.

What did you expect to happen:

When the data mover fails, it should try to clean the lingering resources it created (snapshots & disks).

Or at least, deleting the Failed (or PartiallyFailed) backup should clean up the resources.

Anything else you would like to add:

We have tried the yet-to-be released release-1.14-dev branch to see if the #7898 "/DataUpload is canceled" issue was fixed, and it did. So that's a good point. I think you should release a 1.14.1 quickly for those people in the same case as us.

Environment:

Velero version (use velero version): 1.14.0
Velero features (use velero client config get features): features:
Kubernetes version (use kubectl version):

Client Version: v1.29.0                                                                                                                                                                       Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3                                                                                                                                       Server Version: v1.27.16

Kubernetes installer & version: Azure AKS, provisioned by Terraform
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release): --

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

Lyndon-Li commented 3 weeks ago

Although the dataupload is "canceled", the remaining resources are not cleaned.

Which remaining resources can you see?

Lyndon-Li commented 3 weeks ago

Please share the velero debug bundle by running velero debug

Gui13 commented 3 weeks ago

Although the dataupload is "canceled", the remaining resources are not cleaned.

Which remaining resources can you see?

We can see loads and loads of Azure disks.

We have 163 actual, used, disk PVC in our cluster, and each new Velero backup creates an additionnal 163 PVCs to do the data upload. At the time where I created this issue, we had 2577 "unwanted" PVCs in our cluster, which is ~15 days of velero backups failing at our rate. We deleted all of them since then.

From a look in the Velero DataUpload spec, Velero does this when backuping:

creates Snapshots with the CSI driver for each volume of type disk
for each snapshot, it creates a disk derived from the snapshot,
When the disk is created, it then removes the snapshot // <--- this is working for us right now
it mounts the disk on a node of the cluster // <--- this used to fail in our cluster until you fixed #7898 and we switched to using the release-1.14-dev docker image for our agents, and this is where the disks we left behind IMHO
once mounted, it starts "snapshotting" (in kopia language) this disk to the S3 ("Blobstore" in azure language).

I'm running velero debug right now, but after a quick test on my personal cluster, I see that it retrieves lots of information, I'm not sure I'm allowed to share all this information publicly, since it is our client cluster. Do you have means to clear the logs from PI or maybe to transmit it privately ? (also: it is 390MiB)

blackpiglet commented 3 weeks ago

velero debug CLI collects the Velero-related k8s resources. There is no easy to erase sensitive information.

Could you post one of the DataUpload's YAML content here instead?

I suspect insufficient memory resources of the node-agent pod caused the DataUpload to cancel, the node-agent restarted due to OOM, and the DataUploads was marked as Cancelled on node-agent pod restart.

Lyndon-Li commented 3 weeks ago

@Gui13 Please check this if you still have the leftover PVCs created by Velero: Describe the PVC and check the deletionTempStamp and Finalizer, share us their values

If any one of them is not empty, it means the PVC was indeed deleted but some thing had blocked it. If they are both empty, it looks like a problem Velero's DU cancel mechanism.

Please also share one of the DUs yaml as @blackpiglet mentioned.

Gui13 commented 2 weeks ago

Hi @Lyndon-Li

Here is a backup that failed 15 days ago:

The backup describe

``` ~ velero backup describe velero-loki-20240811000043 --details Name: velero-loki-20240811000043 Namespace: velero Labels: [app.kubernetes.io/instance=velero](http://app.kubernetes.io/instance=velero) [app.kubernetes.io/managed-by=Helm](http://app.kubernetes.io/managed-by=Helm) [app.kubernetes.io/name=velero](http://app.kubernetes.io/name=velero) [helm.sh/chart=velero-7.1.4](http://helm.sh/chart=velero-7.1.4) [velero.io/schedule-name=velero-loki](http://velero.io/schedule-name=velero-loki) [velero.io/storage-location=default](http://velero.io/storage-location=default) Annotations: [meta.helm.sh/release-name=velero](http://meta.helm.sh/release-name=velero) [meta.helm.sh/release-namespace=velero](http://meta.helm.sh/release-namespace=velero) [velero.io/resource-timeout=10m0s](http://velero.io/resource-timeout=10m0s) [velero.io/source-cluster-k8s-gitversion=v1.27.16](http://velero.io/source-cluster-k8s-gitversion=v1.27.16) [velero.io/source-cluster-k8s-major-version=1](http://velero.io/source-cluster-k8s-major-version=1) [velero.io/source-cluster-k8s-minor-version=27](http://velero.io/source-cluster-k8s-minor-version=27) Phase: PartiallyFailed (run velero backup logs velero-loki-20240811000043 for more information) Errors: Velero: Cluster: Namespaces: Namespaces: Included: monitoring Excluded: Resources: Included: * Excluded: Cluster-scoped: auto Label selector: app=loki Or label selector: Storage Location: default Velero-Native Snapshot PVs: auto Snapshot Move Data: true Data Mover: velero TTL: 17568h0m0s CSISnapshotTimeout: 10m0s ItemOperationTimeout: 4h0m0s Hooks: Backup Format Version: 1.1.0 Started: 2024-08-11 02:00:44 +0200 CEST Completed: 2024-08-11 02:01:10 +0200 CEST Expiration: 2026-08-13 02:00:43 +0200 CEST Total items to be backed up: 21 Items backed up: 21 Backup Item Operations: Operation for persistentvolumeclaims monitoring/storage-logging-loki-0: Backup Item Action Plugin: [velero.io/csi-pvc-backupper](http://velero.io/csi-pvc-backupper) Operation ID: du-67757235-06b4-424a-a64b-5fd59b93e421.e4334b14-893c-44a2dc9bb Items to Update: [datauploads.velero.io](http://datauploads.velero.io/) velero/velero-loki-20240811000043-lw8wz Phase: Failed Operation Error: DataUpload is canceled Progress description: Canceled Created: 2024-08-11 02:00:53 +0200 CEST Started: 2024-08-11 02:00:58 +0200 CEST Updated: 2024-08-11 02:00:58 +0200 CEST Resource List: apps/v1/ControllerRevision: - monitoring/logging-loki-666f698c9c - monitoring/logging-loki-78bc8885cf - monitoring/logging-loki-85447f888 apps/v1/StatefulSet: - monitoring/logging-loki [discovery.k8s.io/v1/EndpointSlice](http://discovery.k8s.io/v1/EndpointSlice): - monitoring/logging-loki-headless-rq5dn - monitoring/logging-loki-m6tlz - monitoring/logging-loki-memberlist-4kgrk [rbac.authorization.k8s.io/v1/Role](http://rbac.authorization.k8s.io/v1/Role): - monitoring/logging-loki [rbac.authorization.k8s.io/v1/RoleBinding](http://rbac.authorization.k8s.io/v1/RoleBinding): - monitoring/logging-loki v1/Endpoints: - monitoring/logging-loki - monitoring/logging-loki-headless - monitoring/logging-loki-memberlist v1/Namespace: - monitoring v1/PersistentVolume: - pvc-e4334b14-893c-44a5-9dfa-c41bc3a63d88 v1/PersistentVolumeClaim: - monitoring/storage-logging-loki-0 v1/Pod: - monitoring/logging-loki-0 v1/Secret: - monitoring/logging-loki v1/Service: - monitoring/logging-loki - monitoring/logging-loki-headless - monitoring/logging-loki-memberlist v1/ServiceAccount: - monitoring/logging-loki Backup Volumes: Velero-Native Snapshots: CSI Snapshots: monitoring/storage-logging-loki-0: Data Movement: Operation ID: du-67757235-06b4-424a-a64b-5fd59b93e421.e4334b14-893c-44a2dc9bb Data Mover: velero Uploader Type: kopia Pod Volume Backups: HooksAttempted: 0 HooksFailed: 0 ```

The data upload

```yaml ➜ ~ kubectl get datauploads velero-loki-20240811000043-lw8wz -n velero -o yaml apiVersion: [velero.io/v2alpha1](http://velero.io/v2alpha1) kind: DataUpload metadata: creationTimestamp: "2024-08-11T00:00:53Z" generateName: velero-loki-20240811000043- generation: 3 labels: [velero.io/accepted-by](http://velero.io/accepted-by): aks-default-97945801-vmss00001c [velero.io/async-operation-id](http://velero.io/async-operation-id): du-67757235-06b4-424a-a64b-5fd59b93e421.e4334b14-893c-44a2dc9bb [velero.io/backup-name](http://velero.io/backup-name): velero-loki-20240811000043 [velero.io/backup-uid](http://velero.io/backup-uid): 67757235-06b4-424a-a64b-5fd59b93e421 [velero.io/pvc-uid](http://velero.io/pvc-uid): e4334b14-893c-44a5-9dfa-c41bc3a63d88 name: velero-loki-20240811000043-lw8wz namespace: velero ownerReferences: - apiVersion: [velero.io/v1](http://velero.io/v1) controller: true kind: Backup name: velero-loki-20240811000043 uid: 67757235-06b4-424a-a64b-5fd59b93e421 resourceVersion: "326785784" uid: 0b76aef5-6016-42d9-9082-7db360c73937 spec: backupStorageLocation: default csiSnapshot: snapshotClass: csi-azure-disk storageClass: managed-premium-lrs volumeSnapshot: velero-storage-logging-loki-0-cmj2s operationTimeout: 10m0s snapshotType: CSI sourceNamespace: monitoring sourcePVC: storage-logging-loki-0 status: completionTimestamp: "2024-08-11T00:00:58Z" message: 'found a dataupload velero/velero-loki-20240811000043-lw8wz with expose error: Pod is unschedulable: 0/44 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/44 nodes are available: 44 Preemption is not helpful for scheduling... mark it as cancel' phase: Canceled progress: {} startTimestamp: "2024-08-11T00:00:58Z" ```

The remaining PV (not a PVC, just a PV):

The PV description

```yaml ~ kubectl get pv pvc-116fa2fa-c49c-4975-a029-655ac58b404a -A -o yaml apiVersion: v1 kind: PersistentVolume metadata: annotations: [pv.kubernetes.io/provisioned-by](http://pv.kubernetes.io/provisioned-by): [disk.csi.azure.com](http://disk.csi.azure.com/) [volume.kubernetes.io/provisioner-deletion-secret-name](http://volume.kubernetes.io/provisioner-deletion-secret-name): "" [volume.kubernetes.io/provisioner-deletion-secret-namespace](http://volume.kubernetes.io/provisioner-deletion-secret-namespace): "" creationTimestamp: "2024-08-11T00:01:00Z" finalizers: - [kubernetes.io/pv-protection](http://kubernetes.io/pv-protection) name: pvc-116fa2fa-c49c-4975-a029-655ac58b404a resourceVersion: "326785886" uid: 68cb0368-8796-4644-9f28-88671800c7a9 spec: accessModes: - ReadWriteOnce capacity: storage: 256Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: velero-loki-20240811000043-lw8wz namespace: velero resourceVersion: "326785771" uid: 116fa2fa-c49c-4975-a029-655ac58b404a csi: driver: [disk.csi.azure.com](http://disk.csi.azure.com/) volumeAttributes: cachingmode: ReadOnly [csi.storage.k8s.io/pv/name](http://csi.storage.k8s.io/pv/name): pvc-116fa2fa-c49c-4975-a029-655ac58b404a [csi.storage.k8s.io/pvc/name](http://csi.storage.k8s.io/pvc/name): velero-loki-20240811000043-lw8wz [csi.storage.k8s.io/pvc/namespace](http://csi.storage.k8s.io/pvc/namespace): velero kind: Managed requestedsizegib: "256" [storage.kubernetes.io/csiProvisionerIdentity](http://storage.kubernetes.io/csiProvisionerIdentity): [1723070305729-221-disk.csi.azure.com](http://1723070305729-221-disk.csi.azure.com/) storageaccounttype: Premium_LRS volumeHandle: /subscriptions/b7f4b112-855a-4cf5-9485-a4733fc136ea/resourceGroups/braincube-prod-aks/providers/Microsoft.Compute/disks/pvc-116fa2fa-c49c-4975-a029-655ac58b404a nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: [topology.disk.csi.azure.com/zone](http://topology.disk.csi.azure.com/zone) operator: In values: - "" persistentVolumeReclaimPolicy: Retain storageClassName: managed-premium-lrs volumeMode: Filesystem status: phase: Released ```

As you can see, the PVC doesn't exist, but the DataUpload is still lingering around. The PV has no "DeletedTimestamp", but it inherited our default "Retain" policy. BUT, this seems not to be a problem for the other PVs when the Dataupload performs correctly (they are still removed).

You could probably add a phase to the DataUpload ("CancelCleanup") when you want to cancel it, so that it tries to remove the lingering PV, then transition to "Canceled" phase.

Lyndon-Li commented 2 weeks ago

This is still not an expected behavior, because before deleting the PVC, the PV's reclaim policy will be set to Delete. So the case that PVC is successfully deleted but PV is still kept with Retain as the persistentVolumeReclaimPolicy should not happen.

So we still need the debug log to see what happened. If you are not able to share all the logs, you can just filter the Error and Warning logs.

Gui13 commented 2 weeks ago

If I can catch another instance of this problem I'll get you the logs. Right now, with the 1.14.1-rc we don't have the issue anymore.

vmware-tanzu / velero

Data mover: when a data move fails, the created disk (and snapshot) are not cleaned #8135