Open monotek opened 5 days ago
@monotek The memory usage depends on the pattern of data to be backed up and of data in the repo, please share below info:
We're using datamover to upload pvc snapshots.
2.) We don't have this informartion and it's also very hard to get as you would have to go inside the pod to get that data and most of our containers don't even have a shell, so you would need some way to access to node os to get that data.
3.) There is only data moment backup configured, which goes over all namespaces but excludes some PVCs. Normal pvc snapshot backups are working without issues.
4.) All our nodes have 16 CPUs.
@monotek Several TB is a large dataset and 16 CPUs also mean large parallelism. There are several known issues about the performance/memory usage in these large scale cases. To further troubleshoot and identify the root cause, we need to check your env and data in detail (we probably need a live session and run some commands which may affect your backup data).
Let us know if you are fine to go with that. And we eventually need to know how much data are there and how the data are arranged, so let us know if you cannot find that info in any way.
Describe the problem/challenge you have
Our backups often fail, because data uploads get canceled.
Not sure if it has something to do with the many oom kills we see for the node-agent?
I would also like to know, what the node agent needs this much ram for? In our case the ram limit is set to 6GB already which is imho quite a lot. I also wonder what exactly is the ram used for? Uploading some files should not need this much ram?
Describe the solution you'd like
I would like to see a retry for the job doing the upload and only fail if backoffLimit / activeDeadlineSeconds is reached, which should be configurable.
It would also be nice if a canceled backup is not just retried but continued instead, so already transfered data can be reused.
Anything else you would like to add:
Environment:
velero version
):velero client config get features features:
Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.5