Closed jkroepke closed 2 months ago
Could you give more information about the two metrics velero REQESTS
and velero LIMITS
?
Are those metrics for the Velero server pods?
They seems have default values.
// default values for Velero server pod resource request/limit
DefaultVeleroPodCPURequest = "500m"
DefaultVeleroPodMemRequest = "128Mi"
DefaultVeleroPodCPULimit = "1000m"
DefaultVeleroPodMemLimit = "512Mi"
Yes, thats the Velero Pod:
Mem Request: 512.00 MiB Mem Limit: 1.00 GiB
CPU Req: 0.1000 CPU Limit: 1.0000
Memory Usage:
After each Peak, velero was OOMed
History Memory Usage (we do updates each week):
Last 3 hours (mention we have one hourly schedule)
Could you share more details on the backup workload? How many volumes are involved?
10 Volumes via CSI Driver.
A working backups typically saves 1513 Kubernetes resources.
We experience this issue in approx. 10 Kubernetes clusters. All 10 clusters configured identically (10 volumes, ~1500 resources)
I guess, I found the root cause.
For each backup, the azure plugin is spawned, but not terminated.
root 3353897 0.0 0.0 1236380 11452 ? Sl Jun19 3:56 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 98762dde1ae7a3f1a11e0894ba14c8b4c1f596017280b388b97a104816907683 -address /run/containerd/containerd.sock
65527 3353931 0.0 0.0 972 4 ? Ss Jun19 0:00 \_ /pause
65527 4167441 0.3 0.6 5561796 107432 ? Ssl Jun25 6:59 \_ /velero server --uploader-type=kopia --default-backup-storage-location=opsstack --default-volume-snapshot-locations=velero.io/azure:opsstack --log-format=json --features=EnableCSI --namespace=opsstack
65527 9334 0.0 0.2 1395916 38048 ? Sl Jun25 0:02 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 78788 0.0 0.2 1395916 37500 ? Sl Jun25 0:02 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 148224 0.0 0.2 1395916 36780 ? Sl Jun25 0:02 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 217993 0.0 0.2 1395916 38040 ? Sl Jun25 0:02 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 287368 0.0 0.2 1395916 36944 ? Sl Jun25 0:02 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 356756 0.0 0.2 1395916 36800 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 426057 0.0 0.2 1395916 37744 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 495363 0.0 0.2 1395916 37848 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 564793 0.0 0.2 1395916 36780 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 633070 0.0 0.2 1395916 37848 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 702396 0.0 0.2 1395916 38216 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 771675 0.0 0.2 1395916 37144 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 840974 0.0 0.2 1395916 36844 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 910286 0.0 0.2 1395916 36400 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 979589 0.0 0.2 1395916 37984 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1048695 0.0 0.2 1395916 37004 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1118222 0.0 0.2 1395916 36428 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1187685 0.0 0.2 1395916 37888 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1257170 0.0 0.2 1395916 37488 ? Sl Jun25 0:01 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1326865 0.0 0.2 1395916 36652 ? Sl Jun25 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1397481 0.0 0.2 1395916 38296 ? Sl Jun25 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1465597 0.0 0.2 1395916 36980 ? Sl Jun25 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1535111 0.0 0.2 1395916 37020 ? Sl Jun25 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1604535 0.0 0.2 1395916 37452 ? Sl 00:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1638944 0.0 0.2 1395916 37176 ? Sl 00:31 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1674221 0.0 0.2 1395916 40172 ? Sl 01:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1743626 0.0 0.2 1395916 35636 ? Sl 02:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1812856 0.0 0.2 1395916 37392 ? Sl 03:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1882257 0.0 0.2 1396172 37320 ? Sl 04:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 1951690 0.0 0.2 1395916 36572 ? Sl 05:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 2021143 0.0 0.2 1396172 38152 ? Sl 06:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 2090248 0.0 0.2 1395916 37876 ? Sl 07:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527 2159473 0.0 0.2 1395916 37324 ? Sl 08:01 0:00 \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
I see. The Azure plugin allocates 100M on startup, so if enough Azure plugin instances pile up, it could cause the OOM.
In our case, we will remove the Azure Plugin in for now. It was a left over from the migration to CSI based backups and all old azure plugin based Backups are out of the rotation.
velero-plugin-for-microsoft-azure
is needed for communicating with Azure Object Storage to store the backup and restore metadata.
So maybe it cannot be removed for the AKS environment.
Azure Object Storage to store the backup and restore metadata.
I though, that part is bundled into velero?
https://github.com/vmware-tanzu/velero/blob/main/pkg/util/azure/storage.go
That part of code is used for filesystem backup to interact with cloud providers.
okay ,thanks for confirmation. I guess then we may stuck
@blackpiglet Hmm. In my experience, the plugin process typically gets terminated after each backup -- or more specifically, after the controller reconcile exits, since the GRPC connection is terminated. Maybe we should investigate whether 1) something has changed in 1.14 or 2) some issue with one or more plugins is preventing their exit?
We have only Azure Plugin active. We had CSI Plugin in history, but its now merged in 1.14
@sseago Agree. I will investigate whether something has changed in the plugin workflow, but we had met a similar situation in the E2E test. https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure/pull/210/files This PR was created to mitigate the OOM issue met in the AKS environment, but it's still possible to hit this issue when object storage consumes a significant amount of memory.
So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently.
Our current workaround is now a cronjob that restart velero all 2 hours.
So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently
We check a cluster running with 1.13 velero and we can observe that no plugin instances are left.
Our current workaround is now a cronjob that restart velero all 2 hours.
@jkroepke The risk here is that if you restart velero while a backup or restore is running, that backup/restore will be marked failed since an in-progress action can't be resumed after restart/crash.
Yep, but without that cronjob, it's guaranteed that Velero get OOMed on backup.
Does velero not support graceful shutdown? If a backup/restore is running, I would expect that SIGTERM is delayed until the backup/restore process is finished.
@jkroepke Could you please check the version of the Velero Azure plugin?
@blackpiglet velero/velero-plugin-for-microsoft-azure:v1.10.0 - on initial report, we used 1.9 but then we also upgrade to 1.10 without any difference
We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi. Tracking this if we figure out some pattern. We use 1.13 in downstream
We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi. Tracking this if we figure out some pattern. We use 1.13 in downstream
Please maybe create an dedicated issue for this.
Sure, will avoid any confusion here.
The issue isn't related only to Azure plugin. I ran CSI backups, using Velero scheduling. All backups were completed OK.
Total Backup: 21 Total plugins instances:21
$ ps -ef | grep velero
root 137860 1 0 09:02 ? 00:00:03 /usr/bin/conmon -b /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata -c 658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-adp_velero-77b95849d5-44cjw_2cf677d9-b902-47ae-9c06-42971ac78ab3/velero/0.log --log-level info -n k8s_velero_velero-77b95849d5-44cjw_openshift-adp_2cf677d9-b902-47ae-9c06-42971ac78ab3_0 -P /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata/conmon-pidfile -p /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio --syslog -u 658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b -s
1000830+ 137872 137860 1 09:02 ? 00:01:13 /velero server -features=EnableCSI --uploader-type=kopia --fs-backup-timeout=2880m --restore-resource-priorities=securitycontextconstraints,customresourcedefinitions,klusterletconfigs.config.open-cluster-management.io,managedcluster.cluster.open-cluster-management.io,namespaces,roles,rolebindings,clusterrolebindings,klusterletaddonconfig.agent.open-cluster-management.io,managedclusteraddon.addon.open-cluster-management.io,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io --log-level debug --disable-informer-cache=false
1000830+ 180946 137872 0 09:21 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 192363 137872 0 09:26 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 203868 137872 0 09:31 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 215300 137872 0 09:36 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 226758 137872 0 09:41 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 238226 137872 0 09:46 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 249688 137872 0 09:51 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 261103 137872 0 09:56 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 272539 137872 0 10:01 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 283993 137872 0 10:06 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 295429 137872 0 10:11 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 306872 137872 0 10:16 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 318247 137872 0 10:21 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 329665 137872 0 10:26 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 341128 137872 0 10:31 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 352609 137872 0 10:36 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 364052 137872 0 10:41 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 375590 137872 0 10:46 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 387024 137872 0 10:51 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 398526 137872 0 10:56 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+ 409917 137872 0 11:01 ? 00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
core 418429 139617 0 11:05 pts/0 00:00:00 grep --color=auto velero
root 751578 751549 0 Jun24 ? 01:21:51 /velero node-agent server
So it seems the plugins do not always exist right after it completes
We need to investigate the plugin serving mechanism:
@duduvaa
Could you help to collect the debug bundle of the backups by running the velero debug
CLI, and upload the generated tarball here?
Velero debug failed to run. Which output is needed? I can try collect it
2024/07/10 07:19:32 Collecting velero resources in namespace: openshift-adp
An error occurred: exec failed: Traceback (most recent call last):
velero-debug-collector:23:13: in
@duduvaa We need the entire logs to check the start/stop of the plugin process. It looks like a env problem, could you fix it and get the velero bundle?
I collected logs & describe using 'velero backup logs & describe' (hope you get all the info) backup.tar.gz
@duduvaa Please use velero debug
to get the entire logs, backup logs are not enough.
the velero debug
does not support in-cluster authentication where a kubeconfig does not exists. I had the same issue (see my first post)
@Lyndon-Li I will try to dig into this further, but all of the controllers seem to handle this similarly. For example, in the backup controller:
pluginManager := b.newPluginManager(backupLog)
defer pluginManager.CleanupClients()
The plugin manager keeps track of all the plugin clients that were created, and CleanupClients
calls Stop()
on each (which in turn passes the Kill()
call into the go-plugin API for it. Since we're seeing one plugin process not shut down for each backup, I'm guessing that the cleanup/kill is either not being invoked by one of the controllers somehow, or the Kill() call is failing. We don't have any debug logging in these calls, so I'll probably need to find a reproducing case locally and add some extra logging to see what's happening.
I did several tests with volumes backing up with the CSI plugin and no volume. The plugin process exited normally after backup completion, so I think this issue doesn't appear in normal cases. We need more information about the scenario to understand how this issue is triggered.
@jkroepke Yes, the velero debug
CLI runs as a separate process, so it works a bit differently from the velero server
CLI.
https://github.com/vmware-tanzu/velero/blob/6fb109f6208ce03fbf0cea380023caa9726077cf/pkg/cmd/cli/debug/debug.go#L197-L205
From this piece of code, the velero debug
CLI supports reading kubeconfig
file from parameter --kubeconfig
, and it also supports reading it from the environment variables.
We have this scenario in a test cluster, so let me know what you need. I could also run an velero custom container with extensive logging, if you provide one.
@jkroepke
I suggest to setup the environment to make the velero debug
CLI work.
If the velero debug
cannot work in the Velero server pod directly, could we create a kubeconfig
file and install the Velero CLI on some other environments?
I will look into it, but in parallel, I recommend to check, why you have an hard dependency against the kubeconfig file?
Normally, I'm using the Kubernetes Client SDK, it also supports in-cluster authentication as well.
The difference is caused by how the velero server
and velero debug
CLI work.
The velero server
calls the client-go function to read the k8s configuration to connect to the kube-apiserver.
The velero debug
CLI calls a tool called crashd
to collect the related k8s resources, so it's not possible to use the same mechanism as velero server
to use the InCluster way.
Did you see, that I attached a zip file with all debug logs? According to the issue template, I do all the things manually, what velero debug does.
@jkroepke,
Can you pls share the commands?
thanks
@jkroepke @duduvaa Since both of you have seen this -- are there specific circumstances that reproduce it? Only with datamover? Or CSI-no-datamover? Or fs-backup? With all workloads or only some?
it reproduced with: CSI CSI+DataMover
Kopia Backup - working fine
We have 2 scheudles with only CSI backups enabled. + Azure Blob Storage for manifests.
@jkroepke @duduvaa That's good information. If kopia backup doesn't do this, this suggests that we're closing the object store plugin processes properly in the regular backup controller. The backup operation controller only reconciles when we have async operations -- which only happens for CSI and Datamover backups, so it may be in that code where this is occurring. It gives us a more focused area to dig into this.
In our case, we do Azure File snapshots vis csi. That might be special.
@jkroepke I'm seeing it in my dev cluster with aws object store plugin and CSI snapshots (no datamover). If I use fs-backup, I don't see it.
Some debug progress, but still more to do. We know this happens only for CSI or CSI+datamover backups. I've isolated this to the backup finalizer controller. From the added logging, we see in the other controllers we initialize the needed plugins, then at end of reconcile, the relevant plugins are killed. In the finalizer controller case, though it's a bit different. In my cluster, 3 plugins were initialized (internal, object store, and the OADP openshift BIA/RIA plugin) -- then all 3 of these were killed. However, immediately after killing the aws plugin (again this is due to the defer pluginManager.CleanupClients()
call at end of finalizer reconcile), a new plugin process starts up. This is triggered by a restartableProcess.ResetIfNeeded()
call -- I still need to dig into why this is happening at the end of the reconcile. Maybe some embedded defer call somewhere I'm not seeing? I'm not sure yet. I'll dig into it again tomorrow.
If it helps, you can provide an extensive log build that I will run on my cluster.
We met the same issue here.
@jkroepke @sseago hi, i have the same issue, i suspect the codes here introduce the processes leak. https://github.com/vmware-tanzu/velero/blob/a91d2cb036738263d416463d7cc13ac63915d92f/pkg/backup/backup.go#L729-L743 On my machine, I deleted this section of the code, and the issue of leaks did not occur. but, the exact reason why this code was causing a process leak is unclear.
What steps did you take and what happened:
Setup scheduled Backups via CSI driver. No restic, kopia or other file based copy action are in use.
Memory leak.
We have an additional to the failed daily schedule a hourly schedule. In grafana, I can see a memory increase after each backup.
What did you expect to happen:
No memory leak
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
Does work from a pod with in-cluster authentication:
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
velero.zip
Anything else you would like to add:
Environment:
velero version
): v1.14.0velero client config get features
):<NOT SET>
kubectl version
): v1.29.4/etc/os-release
): Azure LinuxVote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.