High memory usage and OOM killed during maintenance tasks

hbollon commented 8 months ago

Hello team, we are using Velero for a new k8s platform on-premise (using k3s) to backup some of our mounted PVC using FSB feature. We have deployed Velero using the helm chart. We're using it with Kopia uploader to be able to use .kopiaignore file to configure some paths to ignore during backups. The backups storage is located on Scaleway Object Storage and the bucket size is about ~850GB of backup data (38 723 files).

First backup is successful but after that one the Velero pod start to crashloop due to OOM during maintenance tasks (we have configured 6GB memory limit for this Velero pod which should be more than sufficient no?)

The last logs I have before the OOM:

velero time="2024-03-08T08:42:15Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=backups/scaleway-default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:141"
velero time="2024-03-08T08:42:15Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=backups/scaleway-default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126"
velero time="2024-03-08T08:42:17Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:18Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:19Z" level=info msg="Running quick maintenance..." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=info msg="Running quick maintenance..." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:19Z" level=info msg="Finished quick maintenance." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=info msg="Finished quick maintenance." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:20Z" level=info msg="Running maintenance on backup repository" backupRepo=backups/xxx-production-scaleway-default-kopia-lsz8r logSource="pkg/controller/backup_repository_controller.go:285"
velero time="2024-03-08T08:42:21Z" level=warning msg="active indexes [xn0_00f547b5bbe0d3c63853d13cb06dc432-s69be3da756471a94126-c1 [lot of others indexes...] ] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error

I tried to give as much context and informations as possible but if you need others details don't hesitate to ping me, it's a quite urgent issue to us...

What did you expect to happen:

I don't think it's normal that Velero takes so much memory in just a minute during maintenance tasks.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

bundle-2024-03-08-09-47-09.tar.gz

Environment:

Velero version (use velero version): v1.13.0
Velero features (use velero client config get features):
Velero chart version: v5.4.1
Kubernetes version (use kubectl version): v1.28.3
Cloud provider or hardware configuration: On-premise using K3S
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

kaovilai commented 8 months ago

Duplicate of https://github.com/vmware-tanzu/velero/issues/7291

hbollon commented 8 months ago

I don't think this is really a duplicate 🤔 I already follow this feature request which seems a good idea. But in my case I don't think this is a normal behavior to not be able to keep Velero running even with 10GB memory limit during maintenance tasks... even if this is a quite demanding operation It's currently a breaking behavior which prevent us to backups our PVCs, with this issue I research hints on why Velero in our setup use so much memory (if it's, indeed, a not normal behavior) or a workaround to mitigate this issue to be able to do our backups.

Lyndon-Li commented 8 months ago

@hbollon Not sure if a live session is allowed from your side, we need to check some factors of the repo, so a live session is more efficient. If it is not allowed, please also let us know, we will seek for other ways to troubleshoot.

Moreover, please keep the data in the repo, we may need to try troubleshooting, fix verification in the repo since not all env could reproduce the issue.

hbollon commented 8 months ago

Hello @Lyndon-Li A live session is doable on our side since we don't dig into the data present on PVCs / Storage bucket :wink: You can reach me on k8s Slack to organise that

thedarkside commented 7 months ago

Observing this on one of our project clusters as well. Velero pod keeps crashing with OOM. Version 1.13.0

We have round about 1TB of files. Mostly images 5-15mb in size. So there are many files.

Current resources configured:

          resources:
            limits:
              cpu: '1'
              memory: 512Mi
            requests:
              cpu: 500m
              memory: 128Mi

Will try increasing.

thedarkside commented 7 months ago

Yep, bumping those solved it for now!

contributorr commented 6 months ago

@Lyndon-Li @hbollon any updates on this issue? I'm experiencing a similiar problem from a specific environment where velero pod crashes with 2GB of memory limit, but somehow works with 4GB. On the other hand on multiple (even bigger) environments there's no need to increase mem limit to more than 1GB. Is this specific to 1.13.0 - any chance it's fixed in 1.13.1/2? Thanks

Lyndon-Li commented 6 months ago

@hbollon @contributorr There are multiple memory usage improvement in 1.14 which integrates the latest Kopia release. Velero 1.14 will be RC next week, you can try with the RC release and let us know the result. The improvement should be helpful for the problems we identified in @hbollon's environment.

@contributorr Please note that not all memory usage are irrational, varying from the status of the file system (e.g., more files, smaller files), it may take more memory than others.

Lyndon-Li commented 6 months ago

Is this specific to 1.13.0 - any chance it's fixed in 1.13.1/2?

No it is not specific to 1.13. The improvements will be only in 1.14

Lyndon-Li commented 5 months ago

@hbollon @contributorr 1.14 RC is ready, https://github.com/vmware-tanzu/velero/releases/tag/v1.14.0-rc.1. You can try it to see any improvement for your cases.

Lyndon-Li commented 5 months ago

The problem in @hbollon's environment is reproduced locally. Here is the details:

This problem happens in the repo connection stage, so the high memory usage could happen for most of the operations, i.e., backup, restore, maintenance
In order to control the fragment of the repo index blobs, Kopia repo does index compaction periodically. For Velero 1.13 and prior, it is done during repo connection
The index compaction needs to load all the indexes into memory, so it may take huge memory. The memory usage is liner correlated to the count of indexes.
According to the local test, it takes up to 16GB memory for index blobs of 750MB in size (If the files are small and random enough, this means around 21 million files)

1.14 (integrates Kopia 0.17) doesn't solve this problem ultimately, but 1.14 does something better:

Kopia 0.17 does index compaction in maintenance only, so that backup and restore won't be affected
Kopia 0.17 does index compaction for one epoch each time of maintenance, this make the problem less likely happen
Velero 1.14 has moved the repo maintenance into a dedicate job so that the backups/restores done by Velero server and node-agent are not affected even when the problem happens

The problem still happens for 1.14 when huge number of indexes are generated in one backup or consecutive backups in a short time (e.g., 24 hours). So there will be following up fixes post 1.14. The plan is we will find a way to reduce the number of indexes to compact each time, so that controllable memory is used.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

Lyndon-Li commented 1 month ago

The problem still happens for 1.14 when huge number of indexes are generated in one backup or consecutive backups in a short time (e.g., 24 hours). So there will be following up fixes post 1.14

This has been fixed by Kopia upstream PR https://github.com/kopia/kopia/pull/4139 and will be included in Velero 1.15.

Therefore, this issue will be fully fixed in 1.15.

vmware-tanzu / velero

High memory usage and OOM killed during maintenance tasks #7510