kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
693 stars 526 forks source link

efs-plugin Memory Leak in aws-efs-csi-driver:v1.6.0 #1160

Open booleanbetrayal opened 9 months ago

booleanbetrayal commented 9 months ago

What happened?

Possible regression of #474 ?

EKS Managed Add-On Amazon EFS CSI Driver v1.6.0-eksbuild.1 appears to have a memory leak in efs-plugin container:

efs-leak1

efs-leak2

What you expected to happen?

Memory is reclaimed during normal operations.

How to reproduce it (as minimally and precisely as possible)?

Deploy a cluster with the latest version EKS Managed Add-On of Amazon EFS CSI Driver (defaults) and enable the use of EFS based PVCs.

Environment

Please also attach debug logs to help us better diagnose

(Full logs captured and available through direct request due to sensitive values)

efs-utils has several lines such as the following:

2023-10-03 15:55:20 UTC - INFO - Sending signal SIGKILL(9) to stunnel. PID: 3132, group ID: 3132
2023-10-03 15:55:20 UTC - WARNING - Connection timeout for /var/lib/kubelet/pods/da11ea08-fafe-49b5-adbe-520dd52e6bd9/volumes/kubernetes.io~csi/grafana/mount after 30 sec, SIGKILL has been sent to the potential unhealthy stunnel 3132, restarting a new stunnel process.
2023-10-03 15:55:20 UTC - INFO - Starting TLS tunnel: "/usr/bin/stunnel5 /var/run/efs/stunnel-config.fs-29ff199d.var.lib.kubelet.pods.da11ea08-fafe-49b5-adbe-520dd52e6bd9.volumes.kubernetes.io~csi.grafana.mount.20310"
2023-10-03 15:55:20 UTC - INFO - Started TLS tunnel, pid: 3865
2023-10-03 15:55:20 UTC - WARNING - Child TLS tunnel process 3125 has exited, returncode=-9
2023-10-03 15:55:21 UTC - WARNING - Child TLS tunnel process 3132 has exited, returncode=-9
2023-10-03 15:58:21 UTC - INFO - Sending signal SIGKILL(9) to stunnel. PID: 3732, group ID: 3732
2023-10-03 15:58:21 UTC - WARNING - Connection timeout for /var/lib/kubelet/pods/96b83f54-814d-4e26-854d-3e8868754db6/volumes/kubernetes.io~csi/monitoring-prometheus/mount after 30 sec, SIGKILL has been sent to the potential unhealthy stunnel 3732, restarting a new stunnel process.
2023-10-03 15:58:21 UTC - INFO - Starting TLS tunnel: "/usr/bin/stunnel5 /var/run/efs/stunnel-config.fs-f7e30543.var.lib.kubelet.pods.96b83f54-814d-4e26-854d-3e8868754db6.volumes.kubernetes.io~csi.monitoring-prometheus.mount.20099"
2023-10-03 15:58:21 UTC - INFO - Started TLS tunnel, pid: 4466
2023-10-03 15:58:21 UTC - INFO - Sending signal SIGHUP(1) to stunnel. PID: 3851, group ID: 3851
2023-10-03 15:58:21 UTC - WARNING - Child TLS tunnel process 3732 has exited, returncode=-9
2023-10-03 15:59:19 UTC - INFO - Sending signal SIGKILL(9) to stunnel. PID: 3851, group ID: 3851
2023-10-03 15:59:19 UTC - WARNING - Connection timeout for /var/lib/kubelet/pods/ce6b9dc5-dd64-4d91-adca-48cd449a8be1/volumes/kubernetes.io~csi/monitoring-prometheus/mount after 30 sec, SIGKILL has been sent to the potential unhealthy stunnel 3851, restarting a new stunnel process.
2023-10-03 15:59:19 UTC - INFO - Starting TLS tunnel: "/usr/bin/stunnel5 /var/run/efs/stunnel-config.fs-f7e30543.var.lib.kubelet.pods.ce6b9dc5-dd64-4d91-adca-48cd449a8be1.volumes.kubernetes.io~csi.monitoring-prometheus.mount.20352"
2023-10-03 15:59:19 UTC - INFO - Started TLS tunnel, pid: 4588

/kind bug

nm2n commented 7 months ago

We have the same issue with the latest version 1.7.1

We can reproduce the issue with following steps:

Result: the first deployed pod gets stuck in "Terminating" state and in the same node the efs-driver memory usage slowly increases until the pod is killed with OOM.

Additionally to the logs mentioned above we also see plenty of nfs: server 127.0.0.1 not responding, timed out logs from dmesg on the EC2 instance.

jiangfwa commented 5 months ago

Any updates about this issues? We are facing the same issues.

nm2n commented 5 months ago

Issue is still present in 1.7.4

irenedo commented 4 months ago

We are facing the same issue. The bug seems to be in the stunnel packaged version used by efs-plugin and we could fix it using our own image. The most interesting part is that this works with the same stunnel version shipped in the official repo

I don't know if it's possible to add this fix in the original Dockerfile to fix this memory leak.

FROM public.ecr.aws/eks-distro-build-tooling/eks-distro-minimal-base-python-builder:3.9-al2 as stunnel_installer

ENV STUNNEL_VERSION="5.58"
RUN yum install -y tar gzip 1.5-10.amzn2.0.1 tar 1.26-35.amzn2.0.3 gcc 7.3.1-17.amzn2 make 1:3.82-24.amzn2 openssl-devel 1:1.0.2k-24.amzn2.0.11 && \
    yum -y clean all && rm -rf /var/cache && \
    curl -o stunnel-$STUNNEL_VERSION.tar.gz https://www.stunnel.org/archive/5.x/stunnel-$STUNNEL_VERSION.tar.gz && \
    tar -zxvf stunnel-$STUNNEL_VERSION.tar.gz && \
    cd stunnel-$STUNNEL_VERSION && \
    ./configure --prefix=/newroot/ && \
    make && \
    make install && \
    mv /newroot/usr/bin/stunnel /newroot/usr/bin/stunnel5 && \
    cd - && \
    rm -rf stunnel-$STUNNEL_VERSION.tar.gz stunnel-$STUNNEL_VERSION

FROM amazon/aws-efs-csi-driver:v1.7.5

COPY --from=stunnel_installer /newroot /

This is the memory usage before and after using this custom image image

sstarcher commented 2 months ago

Seeing similar behavior using 1.7.4. We see efs-plugin using 35mb and it slowly climbs over a few days to our limit of 350mb and gets oomkilled.

sstarcher commented 1 month ago

I still see this same behavior in 2.0.2

seanzatzdev-amazon commented 2 weeks ago

@sstarcher I've tried the steps others listed above using driver version v2.0.4 and am unable to recreate the issue, memory usage stays relatively flat. Can you describe your setup, how to reproduce the issue, and describe the problematic behavior that you have encountered?

sstarcher commented 1 week ago

Digging in I realized that we had a different version pinned and our version is not as new as I was expecting. I'll update to 2.0.4 or greater and try again.