kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
723 stars 554 forks source link

efs-plugin Memory Leak in aws-efs-csi-driver:v1.6.0 #1160

Open booleanbetrayal opened 1 year ago

booleanbetrayal commented 1 year ago

What happened?

Possible regression of #474 ?

EKS Managed Add-On Amazon EFS CSI Driver v1.6.0-eksbuild.1 appears to have a memory leak in efs-plugin container:

efs-leak1

efs-leak2

What you expected to happen?

Memory is reclaimed during normal operations.

How to reproduce it (as minimally and precisely as possible)?

Deploy a cluster with the latest version EKS Managed Add-On of Amazon EFS CSI Driver (defaults) and enable the use of EFS based PVCs.

Environment

Please also attach debug logs to help us better diagnose

(Full logs captured and available through direct request due to sensitive values)

efs-utils has several lines such as the following:

2023-10-03 15:55:20 UTC - INFO - Sending signal SIGKILL(9) to stunnel. PID: 3132, group ID: 3132
2023-10-03 15:55:20 UTC - WARNING - Connection timeout for /var/lib/kubelet/pods/da11ea08-fafe-49b5-adbe-520dd52e6bd9/volumes/kubernetes.io~csi/grafana/mount after 30 sec, SIGKILL has been sent to the potential unhealthy stunnel 3132, restarting a new stunnel process.
2023-10-03 15:55:20 UTC - INFO - Starting TLS tunnel: "/usr/bin/stunnel5 /var/run/efs/stunnel-config.fs-29ff199d.var.lib.kubelet.pods.da11ea08-fafe-49b5-adbe-520dd52e6bd9.volumes.kubernetes.io~csi.grafana.mount.20310"
2023-10-03 15:55:20 UTC - INFO - Started TLS tunnel, pid: 3865
2023-10-03 15:55:20 UTC - WARNING - Child TLS tunnel process 3125 has exited, returncode=-9
2023-10-03 15:55:21 UTC - WARNING - Child TLS tunnel process 3132 has exited, returncode=-9
2023-10-03 15:58:21 UTC - INFO - Sending signal SIGKILL(9) to stunnel. PID: 3732, group ID: 3732
2023-10-03 15:58:21 UTC - WARNING - Connection timeout for /var/lib/kubelet/pods/96b83f54-814d-4e26-854d-3e8868754db6/volumes/kubernetes.io~csi/monitoring-prometheus/mount after 30 sec, SIGKILL has been sent to the potential unhealthy stunnel 3732, restarting a new stunnel process.
2023-10-03 15:58:21 UTC - INFO - Starting TLS tunnel: "/usr/bin/stunnel5 /var/run/efs/stunnel-config.fs-f7e30543.var.lib.kubelet.pods.96b83f54-814d-4e26-854d-3e8868754db6.volumes.kubernetes.io~csi.monitoring-prometheus.mount.20099"
2023-10-03 15:58:21 UTC - INFO - Started TLS tunnel, pid: 4466
2023-10-03 15:58:21 UTC - INFO - Sending signal SIGHUP(1) to stunnel. PID: 3851, group ID: 3851
2023-10-03 15:58:21 UTC - WARNING - Child TLS tunnel process 3732 has exited, returncode=-9
2023-10-03 15:59:19 UTC - INFO - Sending signal SIGKILL(9) to stunnel. PID: 3851, group ID: 3851
2023-10-03 15:59:19 UTC - WARNING - Connection timeout for /var/lib/kubelet/pods/ce6b9dc5-dd64-4d91-adca-48cd449a8be1/volumes/kubernetes.io~csi/monitoring-prometheus/mount after 30 sec, SIGKILL has been sent to the potential unhealthy stunnel 3851, restarting a new stunnel process.
2023-10-03 15:59:19 UTC - INFO - Starting TLS tunnel: "/usr/bin/stunnel5 /var/run/efs/stunnel-config.fs-f7e30543.var.lib.kubelet.pods.ce6b9dc5-dd64-4d91-adca-48cd449a8be1.volumes.kubernetes.io~csi.monitoring-prometheus.mount.20352"
2023-10-03 15:59:19 UTC - INFO - Started TLS tunnel, pid: 4588

/kind bug

nm2n commented 1 year ago

We have the same issue with the latest version 1.7.1

We can reproduce the issue with following steps:

Result: the first deployed pod gets stuck in "Terminating" state and in the same node the efs-driver memory usage slowly increases until the pod is killed with OOM.

Additionally to the logs mentioned above we also see plenty of nfs: server 127.0.0.1 not responding, timed out logs from dmesg on the EC2 instance.

jiangfwa commented 9 months ago

Any updates about this issues? We are facing the same issues.

nm2n commented 9 months ago

Issue is still present in 1.7.4

irenedo commented 9 months ago

We are facing the same issue. The bug seems to be in the stunnel packaged version used by efs-plugin and we could fix it using our own image. The most interesting part is that this works with the same stunnel version shipped in the official repo

I don't know if it's possible to add this fix in the original Dockerfile to fix this memory leak.

FROM public.ecr.aws/eks-distro-build-tooling/eks-distro-minimal-base-python-builder:3.9-al2 as stunnel_installer

ENV STUNNEL_VERSION="5.58"
RUN yum install -y tar gzip 1.5-10.amzn2.0.1 tar 1.26-35.amzn2.0.3 gcc 7.3.1-17.amzn2 make 1:3.82-24.amzn2 openssl-devel 1:1.0.2k-24.amzn2.0.11 && \
    yum -y clean all && rm -rf /var/cache && \
    curl -o stunnel-$STUNNEL_VERSION.tar.gz https://www.stunnel.org/archive/5.x/stunnel-$STUNNEL_VERSION.tar.gz && \
    tar -zxvf stunnel-$STUNNEL_VERSION.tar.gz && \
    cd stunnel-$STUNNEL_VERSION && \
    ./configure --prefix=/newroot/ && \
    make && \
    make install && \
    mv /newroot/usr/bin/stunnel /newroot/usr/bin/stunnel5 && \
    cd - && \
    rm -rf stunnel-$STUNNEL_VERSION.tar.gz stunnel-$STUNNEL_VERSION

FROM amazon/aws-efs-csi-driver:v1.7.5

COPY --from=stunnel_installer /newroot /

This is the memory usage before and after using this custom image image

sstarcher commented 6 months ago

Seeing similar behavior using 1.7.4. We see efs-plugin using 35mb and it slowly climbs over a few days to our limit of 350mb and gets oomkilled.

sstarcher commented 6 months ago

I still see this same behavior in 2.0.2

seanzatzdev-amazon commented 5 months ago

@sstarcher I've tried the steps others listed above using driver version v2.0.4 and am unable to recreate the issue, memory usage stays relatively flat. Can you describe your setup, how to reproduce the issue, and describe the problematic behavior that you have encountered?

sstarcher commented 4 months ago

Digging in I realized that we had a different version pinned and our version is not as new as I was expecting. I'll update to 2.0.4 or greater and try again.

seanzatzdev-amazon commented 2 months ago

@sstarcher Any update on if you could recreate this?

sstarcher commented 2 months ago

I have not been able to reproduce

TNonet commented 2 months ago

I am seeing similar issues running v2.0.7-eksbuild.1.

image

Is it safe to set a lower limit and let this continue to OOM?

seanzatzdev-amazon commented 2 months ago

Hey @TNonet how many pods are you running on your cluster? Also, does the memory usage plateau/cap under 1.5GB, or does it continue to increase substantially? What throughput mode is your EFS using?

TNonet commented 2 months ago

We are running thousands of pods, but only ~30-50 of them across a dozen or fewer nodes use EFS.

I have not seen it go above 1.5GB and it does seem to plateau (sometimes at less than a GB), but until recently, I was not tracking this data, so it is possible.

We are using EFS with elastic throughput and general-purpose performance (controlled via terraform)

seanzatzdev-amazon commented 2 months ago

I see, this may be just due to the proxy process using more memory than efs-utils in versions 2.0+ as it multiplexes connections, we advise to set memory limits in accordance with your workloads, so I'd say keep that increased limit in place. If you see the memory usage continue to climb significantly please update us here

seanzatzdev-amazon commented 1 month ago

@TNonet have you seen any significant increase in the memory usage?

TNonet commented 1 month ago

We have had a few pods hit 2 - 2.5 GBs but nothing beyond that.