Open Gui13 opened 3 weeks ago
Although the dataupload is "canceled", the remaining resources are not cleaned.
Which remaining resources can you see?
Please share the velero debug bundle by running velero debug
Although the dataupload is "canceled", the remaining resources are not cleaned.
Which remaining resources can you see?
We can see loads and loads of Azure disks.
We have 163 actual, used, disk PVC in our cluster, and each new Velero backup creates an additionnal 163 PVCs to do the data upload. At the time where I created this issue, we had 2577 "unwanted" PVCs in our cluster, which is ~15 days of velero backups failing at our rate. We deleted all of them since then.
From a look in the Velero DataUpload spec, Velero does this when backuping:
release-1.14-dev
docker image for our agents, and this is where the disks we left behind IMHOI'm running velero debug
right now, but after a quick test on my personal cluster, I see that it retrieves lots of information, I'm not sure I'm allowed to share all this information publicly, since it is our client cluster. Do you have means to clear the logs from PI or maybe to transmit it privately ? (also: it is 390MiB)
velero debug
CLI collects the Velero-related k8s resources. There is no easy to erase sensitive information.
Could you post one of the DataUpload's YAML content here instead?
I suspect insufficient memory resources of the node-agent pod caused the DataUpload to cancel, the node-agent restarted due to OOM, and the DataUploads was marked as Cancelled on node-agent pod restart.
@Gui13 Please check this if you still have the leftover PVCs created by Velero: Describe the PVC and check the deletionTempStamp and Finalizer, share us their values
If any one of them is not empty, it means the PVC was indeed deleted but some thing had blocked it. If they are both empty, it looks like a problem Velero's DU cancel mechanism.
Please also share one of the DUs yaml as @blackpiglet mentioned.
Hi @Lyndon-Li
Here is a backup that failed 15 days ago:
The remaining PV (not a PVC, just a PV):
As you can see, the PVC doesn't exist, but the DataUpload is still lingering around. The PV has no "DeletedTimestamp", but it inherited our default "Retain" policy. BUT, this seems not to be a problem for the other PVs when the Dataupload performs correctly (they are still removed).
You could probably add a phase to the DataUpload ("CancelCleanup") when you want to cancel it, so that it tries to remove the lingering PV, then transition to "Canceled" phase.
This is still not an expected behavior, because before deleting the PVC, the PV's reclaim policy will be set to Delete
.
So the case that PVC is successfully deleted but PV is still kept with Retain
as the persistentVolumeReclaimPolicy
should not happen.
So we still need the debug log to see what happened. If you are not able to share all the logs, you can just filter the Error and Warning logs.
If I can catch another instance of this problem I'll get you the logs. Right now, with the 1.14.1-rc we don't have the issue anymore.
What steps did you take and what happened:
We are using velero 1.14.0 on Azure AKS, with the data mover feature, and are suffering from the bug #7898. All our backups are partially failed, due to data move being canceled.
We are awaiting the 1.14.1, but in the meantime, we have an ongoing issue with the side effect of the bug: when the data mover job fails, it doesn't clean the created managed disk that were to be used by the data mover.
This is causing a runaway cost increase, because we have 2000+ (at the moment) provisioned disks which are not cleaned, and take up daily costs.
Typical output in the error section of the velero backup is as such:
Although the dataupload is "canceled", the remaining resources are not cleaned. Deleting the faulty backup will not release the created disks. We tried to
velero delete backup <FaultyBackupWith163LingeringDisk>
but this didn't work (although the backup was correctly deleted)We are removing these disks manually right now, but I think they should be cleaned by the Velero data mover as a last-ditch, even if the data move failed.
What did you expect to happen:
When the data mover fails, it should try to clean the lingering resources it created (snapshots & disks).
Or at least, deleting the Failed (or PartiallyFailed) backup should clean up the resources.
Anything else you would like to add:
We have tried the yet-to-be released release-1.14-dev branch to see if the #7898 "/DataUpload is canceled" issue was fixed, and it did. So that's a good point. I think you should release a 1.14.1 quickly for those people in the same case as us.
Environment:
velero version
): 1.14.0velero client config get features
): features:kubectl version
):/etc/os-release
): --Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.