v1.14.0+: Memory leaks on backups jobs (CSI backups + cloud plugin)

jkroepke commented 3 months ago

What steps did you take and what happened:

Setup scheduled Backups via CSI driver. No restic, kopia or other file based copy action are in use.

Memory leak.

We have an additional to the failed daily schedule a hourly schedule. In grafana, I can see a memory increase after each backup.

What did you expect to happen:

No memory leak

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

Does work from a pod with in-cluster authentication:

velero debug  --backup opsstack-velero-daily-opsstack-20240625003049 --namespace opsstack
2024/06/25 15:38:09 Collecting velero resources in namespace: opsstack
An error occurred: exec failed: Traceback (most recent call last):
  velero-debug-collector:23:13: in <toplevel>
  <builtin>: in kube_get
Error: could not initialize search client: stat /tmp/.kube/config: no such file or directory

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

velero.zip

Anything else you would like to add:

Environment:

Velero version (use velero version): v1.14.0
Velero features (use velero client config get features): <NOT SET>
Kubernetes version (use kubectl version): v1.29.4
Kubernetes installer & version: AKS
Cloud provider or hardware configuration: AKS
OS (e.g. from /etc/os-release): Azure Linux

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

blackpiglet commented 3 months ago

Could you give more information about the two metrics velero REQESTS and velero LIMITS? Are those metrics for the Velero server pods?

They seems have default values.

    // default values for Velero server pod resource request/limit
    DefaultVeleroPodCPURequest = "500m"
    DefaultVeleroPodMemRequest = "128Mi"
    DefaultVeleroPodCPULimit   = "1000m"
    DefaultVeleroPodMemLimit   = "512Mi"

jkroepke commented 3 months ago

Yes, thats the Velero Pod:

Mem Request: 512.00 MiB Mem Limit: 1.00 GiB

CPU Req: 0.1000 CPU Limit: 1.0000

Memory Usage:

After each Peak, velero was OOMed

History Memory Usage (we do updates each week):

Last 3 hours (mention we have one hourly schedule)

blackpiglet commented 3 months ago

Could you share more details on the backup workload? How many volumes are involved?

jkroepke commented 3 months ago

10 Volumes via CSI Driver.

A working backups typically saves 1513 Kubernetes resources.

We experience this issue in approx. 10 Kubernetes clusters. All 10 clusters configured identically (10 volumes, ~1500 resources)

jkroepke commented 3 months ago

I guess, I found the root cause.

For each backup, the azure plugin is spawned, but not terminated.

root     3353897  0.0  0.0 1236380 11452 ?       Sl   Jun19   3:56 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 98762dde1ae7a3f1a11e0894ba14c8b4c1f596017280b388b97a104816907683 -address /run/containerd/containerd.sock
65527    3353931  0.0  0.0    972     4 ?        Ss   Jun19   0:00  \_ /pause
65527    4167441  0.3  0.6 5561796 107432 ?      Ssl  Jun25   6:59  \_ /velero server --uploader-type=kopia --default-backup-storage-location=opsstack --default-volume-snapshot-locations=velero.io/azure:opsstack --log-format=json --features=EnableCSI --namespace=opsstack
65527       9334  0.0  0.2 1395916 38048 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527      78788  0.0  0.2 1395916 37500 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     148224  0.0  0.2 1395916 36780 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     217993  0.0  0.2 1395916 38040 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     287368  0.0  0.2 1395916 36944 ?       Sl   Jun25   0:02      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     356756  0.0  0.2 1395916 36800 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     426057  0.0  0.2 1395916 37744 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     495363  0.0  0.2 1395916 37848 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     564793  0.0  0.2 1395916 36780 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     633070  0.0  0.2 1395916 37848 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     702396  0.0  0.2 1395916 38216 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     771675  0.0  0.2 1395916 37144 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     840974  0.0  0.2 1395916 36844 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     910286  0.0  0.2 1395916 36400 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527     979589  0.0  0.2 1395916 37984 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1048695  0.0  0.2 1395916 37004 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1118222  0.0  0.2 1395916 36428 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1187685  0.0  0.2 1395916 37888 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1257170  0.0  0.2 1395916 37488 ?       Sl   Jun25   0:01      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1326865  0.0  0.2 1395916 36652 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1397481  0.0  0.2 1395916 38296 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1465597  0.0  0.2 1395916 36980 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1535111  0.0  0.2 1395916 37020 ?       Sl   Jun25   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1604535  0.0  0.2 1395916 37452 ?       Sl   00:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1638944  0.0  0.2 1395916 37176 ?       Sl   00:31   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1674221  0.0  0.2 1395916 40172 ?       Sl   01:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1743626  0.0  0.2 1395916 35636 ?       Sl   02:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1812856  0.0  0.2 1395916 37392 ?       Sl   03:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1882257  0.0  0.2 1396172 37320 ?       Sl   04:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    1951690  0.0  0.2 1395916 36572 ?       Sl   05:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2021143  0.0  0.2 1396172 38152 ?       Sl   06:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2090248  0.0  0.2 1395916 37876 ?       Sl   07:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI
65527    2159473  0.0  0.2 1395916 37324 ?       Sl   08:01   0:00      \_ /plugins/velero-plugin-for-microsoft-azure --log-level info --features EnableCSI

blackpiglet commented 3 months ago

I see. The Azure plugin allocates 100M on startup, so if enough Azure plugin instances pile up, it could cause the OOM.

jkroepke commented 3 months ago

In our case, we will remove the Azure Plugin in for now. It was a left over from the migration to CSI based backups and all old azure plugin based Backups are out of the rotation.

blackpiglet commented 3 months ago

velero-plugin-for-microsoft-azure is needed for communicating with Azure Object Storage to store the backup and restore metadata. So maybe it cannot be removed for the AKS environment.

jkroepke commented 3 months ago

Azure Object Storage to store the backup and restore metadata.

I though, that part is bundled into velero?

https://github.com/vmware-tanzu/velero/blob/main/pkg/util/azure/storage.go

blackpiglet commented 3 months ago

That part of code is used for filesystem backup to interact with cloud providers.

jkroepke commented 3 months ago

okay ,thanks for confirmation. I guess then we may stuck

sseago commented 3 months ago

@blackpiglet Hmm. In my experience, the plugin process typically gets terminated after each backup -- or more specifically, after the controller reconcile exits, since the GRPC connection is terminated. Maybe we should investigate whether 1) something has changed in 1.14 or 2) some issue with one or more plugins is preventing their exit?

jkroepke commented 3 months ago

We have only Azure Plugin active. We had CSI Plugin in history, but its now merged in 1.14

blackpiglet commented 3 months ago

@sseago Agree. I will investigate whether something has changed in the plugin workflow, but we had met a similar situation in the E2E test. https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure/pull/210/files This PR was created to mitigate the OOM issue met in the AKS environment, but it's still possible to hit this issue when object storage consumes a significant amount of memory.

So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently.

jkroepke commented 3 months ago

Our current workaround is now a cronjob that restart velero all 2 hours.

jkroepke commented 3 months ago

So it seems the plugins do not always exist right after it completes, and that behavior is not introduced recently

We check a cluster running with 1.13 velero and we can observe that no plugin instances are left.

sseago commented 3 months ago

Our current workaround is now a cronjob that restart velero all 2 hours.

@jkroepke The risk here is that if you restart velero while a backup or restore is running, that backup/restore will be marked failed since an in-progress action can't be resumed after restart/crash.

jkroepke commented 3 months ago

Yep, but without that cronjob, it's guaranteed that Velero get OOMed on backup.

Does velero not support graceful shutdown? If a backup/restore is running, I would expect that SIGTERM is delayed until the backup/restore process is finished.

blackpiglet commented 2 months ago

@jkroepke Could you please check the version of the Velero Azure plugin?

jkroepke commented 2 months ago

@blackpiglet velero/velero-plugin-for-microsoft-azure:v1.10.0 - on initial report, we used 1.9 but then we also upgrade to 1.10 without any difference

anshulahuja98 commented 2 months ago

We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi. Tracking this if we figure out some pattern. We use 1.13 in downstream

jkroepke commented 2 months ago

We have been seeing frequent OOMKilled in restores, where we have Limit set as 1Gi. Tracking this if we figure out some pattern. We use 1.13 in downstream

Please maybe create an dedicated issue for this.

anshulahuja98 commented 2 months ago

Sure, will avoid any confusion here.

duduvaa commented 2 months ago

The issue isn't related only to Azure plugin. I ran CSI backups, using Velero scheduling. All backups were completed OK.

Total Backup: 21 Total plugins instances:21

$ ps -ef | grep velero
root      137860       1  0 09:02 ?        00:00:03 /usr/bin/conmon -b /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata -c 658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-adp_velero-77b95849d5-44cjw_2cf677d9-b902-47ae-9c06-42971ac78ab3/velero/0.log --log-level info -n k8s_velero_velero-77b95849d5-44cjw_openshift-adp_2cf677d9-b902-47ae-9c06-42971ac78ab3_0 -P /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata/conmon-pidfile -p /run/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio --syslog -u 658303d8e801341e45275428224e536422b7f51c0961d22708f3c9af5873d14b -s
1000830+  137872  137860  1 09:02 ?        00:01:13 /velero server -features=EnableCSI --uploader-type=kopia --fs-backup-timeout=2880m --restore-resource-priorities=securitycontextconstraints,customresourcedefinitions,klusterletconfigs.config.open-cluster-management.io,managedcluster.cluster.open-cluster-management.io,namespaces,roles,rolebindings,clusterrolebindings,klusterletaddonconfig.agent.open-cluster-management.io,managedclusteraddon.addon.open-cluster-management.io,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io --log-level debug --disable-informer-cache=false
1000830+  180946  137872  0 09:21 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  192363  137872  0 09:26 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  203868  137872  0 09:31 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  215300  137872  0 09:36 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  226758  137872  0 09:41 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  238226  137872  0 09:46 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  249688  137872  0 09:51 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  261103  137872  0 09:56 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  272539  137872  0 10:01 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  283993  137872  0 10:06 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  295429  137872  0 10:11 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  306872  137872  0 10:16 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  318247  137872  0 10:21 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  329665  137872  0 10:26 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  341128  137872  0 10:31 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  352609  137872  0 10:36 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  364052  137872  0 10:41 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  375590  137872  0 10:46 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  387024  137872  0 10:51 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  398526  137872  0 10:56 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
1000830+  409917  137872  0 11:01 ?        00:00:00 /plugins/velero-plugin-for-aws --log-level debug --features EnableCSI
core      418429  139617  0 11:05 pts/0    00:00:00 grep --color=auto velero
root      751578  751549  0 Jun24 ?        01:21:51 /velero node-agent server

Lyndon-Li commented 2 months ago

So it seems the plugins do not always exist right after it completes

We need to investigate the plugin serving mechanism:

If the plugin process doesn't exit right after backup completes, it may indicate that the process's life cycle is not the same with the caller, instead, the process always waits a fix of time for new calls to come. If so, the next question is why the next backup creates a new process?
Otherwise, we should investigate why the process doesn't quit immediately since its life cycle should be the same with the caller

blackpiglet commented 2 months ago

@duduvaa Could you help to collect the debug bundle of the backups by running the velero debug CLI, and upload the generated tarball here?

duduvaa commented 2 months ago

Velero debug failed to run. Which output is needed? I can try collect it

2024/07/10 07:19:32 Collecting velero resources in namespace: openshift-adp An error occurred: exec failed: Traceback (most recent call last): velero-debug-collector:23:13: in

: in kube_get Error: could not initialize search client: stat /home/velero/.kube/config: no such file or directory command terminated with exit code 1

Lyndon-Li commented 2 months ago

@duduvaa We need the entire logs to check the start/stop of the plugin process. It looks like a env problem, could you fix it and get the velero bundle?

duduvaa commented 2 months ago

I collected logs & describe using 'velero backup logs & describe' (hope you get all the info) backup.tar.gz

Lyndon-Li commented 2 months ago

@duduvaa Please use velero debug to get the entire logs, backup logs are not enough.

jkroepke commented 2 months ago

the velero debug does not support in-cluster authentication where a kubeconfig does not exists. I had the same issue (see my first post)

sseago commented 2 months ago

@Lyndon-Li I will try to dig into this further, but all of the controllers seem to handle this similarly. For example, in the backup controller:

    pluginManager := b.newPluginManager(backupLog)
    defer pluginManager.CleanupClients()

The plugin manager keeps track of all the plugin clients that were created, and CleanupClients calls Stop() on each (which in turn passes the Kill() call into the go-plugin API for it. Since we're seeing one plugin process not shut down for each backup, I'm guessing that the cleanup/kill is either not being invoked by one of the controllers somehow, or the Kill() call is failing. We don't have any debug logging in these calls, so I'll probably need to find a reproducing case locally and add some extra logging to see what's happening.

blackpiglet commented 2 months ago

I did several tests with volumes backing up with the CSI plugin and no volume. The plugin process exited normally after backup completion, so I think this issue doesn't appear in normal cases. We need more information about the scenario to understand how this issue is triggered.

@jkroepke Yes, the velero debug CLI runs as a separate process, so it works a bit differently from the velero server CLI. https://github.com/vmware-tanzu/velero/blob/6fb109f6208ce03fbf0cea380023caa9726077cf/pkg/cmd/cli/debug/debug.go#L197-L205 From this piece of code, the velero debug CLI supports reading kubeconfig file from parameter --kubeconfig, and it also supports reading it from the environment variables.

jkroepke commented 2 months ago

We have this scenario in a test cluster, so let me know what you need. I could also run an velero custom container with extensive logging, if you provide one.

blackpiglet commented 2 months ago

@jkroepke I suggest to setup the environment to make the velero debug CLI work. If the velero debug cannot work in the Velero server pod directly, could we create a kubeconfig file and install the Velero CLI on some other environments?

jkroepke commented 2 months ago

I will look into it, but in parallel, I recommend to check, why you have an hard dependency against the kubeconfig file?

Normally, I'm using the Kubernetes Client SDK, it also supports in-cluster authentication as well.

blackpiglet commented 2 months ago

The difference is caused by how the velero server and velero debug CLI work. The velero server calls the client-go function to read the k8s configuration to connect to the kube-apiserver. The velero debug CLI calls a tool called crashd to collect the related k8s resources, so it's not possible to use the same mechanism as velero server to use the InCluster way.

jkroepke commented 2 months ago

Did you see, that I attached a zip file with all debug logs? According to the issue template, I do all the things manually, what velero debug does.

duduvaa commented 2 months ago

@jkroepke,

Can you pls share the commands?

thanks

jkroepke commented 2 months ago

https://github.com/vmware-tanzu/velero/blob/255a51f695d42a1ec4e5002716cc9e6599eb6cf1/.github/ISSUE_TEMPLATE/bug_report.md?plain=1#L18-L24

sseago commented 2 months ago

@jkroepke @duduvaa Since both of you have seen this -- are there specific circumstances that reproduce it? Only with datamover? Or CSI-no-datamover? Or fs-backup? With all workloads or only some?

duduvaa commented 2 months ago

it reproduced with: CSI CSI+DataMover

Kopia Backup - working fine

jkroepke commented 2 months ago

We have 2 scheudles with only CSI backups enabled. + Azure Blob Storage for manifests.

sseago commented 2 months ago

@jkroepke @duduvaa That's good information. If kopia backup doesn't do this, this suggests that we're closing the object store plugin processes properly in the regular backup controller. The backup operation controller only reconciles when we have async operations -- which only happens for CSI and Datamover backups, so it may be in that code where this is occurring. It gives us a more focused area to dig into this.

jkroepke commented 2 months ago

In our case, we do Azure File snapshots vis csi. That might be special.

sseago commented 2 months ago

@jkroepke I'm seeing it in my dev cluster with aws object store plugin and CSI snapshots (no datamover). If I use fs-backup, I don't see it.

sseago commented 2 months ago

Some debug progress, but still more to do. We know this happens only for CSI or CSI+datamover backups. I've isolated this to the backup finalizer controller. From the added logging, we see in the other controllers we initialize the needed plugins, then at end of reconcile, the relevant plugins are killed. In the finalizer controller case, though it's a bit different. In my cluster, 3 plugins were initialized (internal, object store, and the OADP openshift BIA/RIA plugin) -- then all 3 of these were killed. However, immediately after killing the aws plugin (again this is due to the defer pluginManager.CleanupClients() call at end of finalizer reconcile), a new plugin process starts up. This is triggered by a restartableProcess.ResetIfNeeded() call -- I still need to dig into why this is happening at the end of the reconcile. Maybe some embedded defer call somewhere I'm not seeing? I'm not sure yet. I'll dig into it again tomorrow.

jkroepke commented 2 months ago

If it helps, you can provide an extensive log build that I will run on my cluster.

xlgao-zju commented 2 months ago

We met the same issue here.

yuanqijing commented 2 months ago

@jkroepke @sseago hi, i have the same issue, i suspect the codes here introduce the processes leak. https://github.com/vmware-tanzu/velero/blob/a91d2cb036738263d416463d7cc13ac63915d92f/pkg/backup/backup.go#L729-L743 On my machine, I deleted this section of the code, and the issue of leaks did not occur. but, the exact reason why this code was causing a process leak is unclear.

vmware-tanzu / velero

v1.14.0+: Memory leaks on backups jobs (CSI backups + cloud plugin) #7925