vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.4k stars 1.37k forks source link

Failed dataupload because of status "InProgress" should recover / retry #7919

Open monotek opened 5 days ago

monotek commented 5 days ago

Describe the problem/challenge you have

Our backups often fail, because data uploads get canceled.

  completionTimestamp: "2024-06-24T03:11:20Z"
  message: found a dataupload with status "InProgress" during the node-agent starting,
    mark it as cancel
  node: aks-zone2node-26731349-vmss0000by
  phase: Canceled
  progress: {}
  startTimestamp: "2024-06-24T03:10:58Z"

Not sure if it has something to do with the many oom kills we see for the node-agent?

I would also like to know, what the node agent needs this much ram for? In our case the ram limit is set to 6GB already which is imho quite a lot. I also wonder what exactly is the ram used for? Uploading some files should not need this much ram?

Describe the solution you'd like

I would like to see a retry for the job doing the upload and only fail if backoffLimit / activeDeadlineSeconds is reached, which should be configurable.

It would also be nice if a canceled backup is not just retried but continued instead, so already transfered data can be reused.

Anything else you would like to add:

Environment:

- Velero features (use `velero client config get features`): 

velero client config get features features:

- Kubernetes version (use `kubectl version`):

Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.5


- Kubernetes installer & version: aks
- Cloud provider or hardware configuration: Standard_D16ds_v5
- OS (e.g. from `/etc/os-release`): ubuntu

**Vote on this issue!**

This is an invitation to the Velero community to vote on issues, you can see the project's [top voted issues listed here](https://github.com/vmware-tanzu/velero/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc).  
Use the "reaction smiley face" up to the right of this comment to vote.

- :+1: for "The project would be better with this feature added"
- :-1: for "This feature will not enhance the project in a meaningful way"
Lyndon-Li commented 5 days ago

@monotek The memory usage depends on the pattern of data to be backed up and of data in the repo, please share below info:

  1. The total data size to be backed up
  2. The average file size and file count to be backed up
  3. How many backups exist in the repo
  4. How many CPU cores in the node
monotek commented 5 days ago

We're using datamover to upload pvc snapshots.

  1. ) Thats not known. We have hundreds of PVCs. Not all of them are included in the backups but it will still be several Terrabytes. I can't check the blob storage we're using for the data movement too, as we were never able to finish a single one of these data movement backups succesfully.

2.) We don't have this informartion and it's also very hard to get as you would have to go inside the pod to get that data and most of our containers don't even have a shell, so you would need some way to access to node os to get that data.

3.) There is only data moment backup configured, which goes over all namespaces but excludes some PVCs. Normal pvc snapshot backups are working without issues.

4.) All our nodes have 16 CPUs.

Lyndon-Li commented 5 days ago

@monotek Several TB is a large dataset and 16 CPUs also mean large parallelism. There are several known issues about the performance/memory usage in these large scale cases. To further troubleshoot and identify the root cause, we need to check your env and data in detail (we probably need a live session and run some commands which may affect your backup data).

Let us know if you are fine to go with that. And we eventually need to know how much data are there and how the data are arranged, so let us know if you cannot find that info in any way.