Closed hbollon closed 4 weeks ago
Duplicate of https://github.com/vmware-tanzu/velero/issues/7291
I don't think this is really a duplicate 🤔 I already follow this feature request which seems a good idea. But in my case I don't think this is a normal behavior to not be able to keep Velero running even with 10GB memory limit during maintenance tasks... even if this is a quite demanding operation It's currently a breaking behavior which prevent us to backups our PVCs, with this issue I research hints on why Velero in our setup use so much memory (if it's, indeed, a not normal behavior) or a workaround to mitigate this issue to be able to do our backups.
@hbollon Not sure if a live session is allowed from your side, we need to check some factors of the repo, so a live session is more efficient. If it is not allowed, please also let us know, we will seek for other ways to troubleshoot.
Moreover, please keep the data in the repo, we may need to try troubleshooting, fix verification in the repo since not all env could reproduce the issue.
Hello @Lyndon-Li A live session is doable on our side since we don't dig into the data present on PVCs / Storage bucket :wink: You can reach me on k8s Slack to organise that
Observing this on one of our project clusters as well. Velero pod keeps crashing with OOM. Version 1.13.0
We have round about 1TB of files. Mostly images 5-15mb in size. So there are many files.
Current resources configured:
resources:
limits:
cpu: '1'
memory: 512Mi
requests:
cpu: 500m
memory: 128Mi
Will try increasing.
Yep, bumping those solved it for now!
@Lyndon-Li @hbollon any updates on this issue? I'm experiencing a similiar problem from a specific environment where velero pod crashes with 2GB of memory limit, but somehow works with 4GB. On the other hand on multiple (even bigger) environments there's no need to increase mem limit to more than 1GB. Is this specific to 1.13.0 - any chance it's fixed in 1.13.1/2? Thanks
@hbollon @contributorr There are multiple memory usage improvement in 1.14 which integrates the latest Kopia release. Velero 1.14 will be RC next week, you can try with the RC release and let us know the result. The improvement should be helpful for the problems we identified in @hbollon's environment.
@contributorr Please note that not all memory usage are irrational, varying from the status of the file system (e.g., more files, smaller files), it may take more memory than others.
Is this specific to 1.13.0 - any chance it's fixed in 1.13.1/2?
No it is not specific to 1.13. The improvements will be only in 1.14
@hbollon @contributorr 1.14 RC is ready, https://github.com/vmware-tanzu/velero/releases/tag/v1.14.0-rc.1. You can try it to see any improvement for your cases.
The problem in @hbollon's environment is reproduced locally. Here is the details:
1.14 (integrates Kopia 0.17) doesn't solve this problem ultimately, but 1.14 does something better:
The problem still happens for 1.14 when huge number of indexes are generated in one backup or consecutive backups in a short time (e.g., 24 hours). So there will be following up fixes post 1.14. The plan is we will find a way to reduce the number of indexes to compact each time, so that controllable memory is used.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
The problem still happens for 1.14 when huge number of indexes are generated in one backup or consecutive backups in a short time (e.g., 24 hours). So there will be following up fixes post 1.14
This has been fixed by Kopia upstream PR https://github.com/kopia/kopia/pull/4139 and will be included in Velero 1.15.
Therefore, this issue will be fully fixed in 1.15.
Hello team, we are using Velero for a new k8s platform on-premise (using k3s) to backup some of our mounted PVC using FSB feature. We have deployed Velero using the helm chart. We're using it with Kopia uploader to be able to use
.kopiaignore
file to configure some paths to ignore during backups. The backups storage is located on Scaleway Object Storage and the bucket size is about ~850GB of backup data (38 723 files).First backup is successful but after that one the Velero pod start to crashloop due to OOM during maintenance tasks (we have configured 6GB memory limit for this Velero pod which should be more than sufficient no?)
The last logs I have before the OOM:
I tried to give as much context and informations as possible but if you need others details don't hesitate to ping me, it's a quite urgent issue to us...
What did you expect to happen:
I don't think it's normal that Velero takes so much memory in just a minute during maintenance tasks.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
bundle-2024-03-08-09-47-09.tar.gz
Environment:
velero version
): v1.13.0velero client config get features
):kubectl version
): v1.28.3/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.