Open booleanbetrayal opened 1 year ago
We have the same issue with the latest version 1.7.1
We can reproduce the issue with following steps:
Result: the first deployed pod gets stuck in "Terminating" state and in the same node the efs-driver memory usage slowly increases until the pod is killed with OOM.
Additionally to the logs mentioned above we also see plenty of nfs: server 127.0.0.1 not responding, timed out
logs from dmesg
on the EC2 instance.
Any updates about this issues? We are facing the same issues.
Issue is still present in 1.7.4
We are facing the same issue. The bug seems to be in the stunnel packaged version used by efs-plugin and we could fix it using our own image. The most interesting part is that this works with the same stunnel version shipped in the official repo
I don't know if it's possible to add this fix in the original Dockerfile to fix this memory leak.
FROM public.ecr.aws/eks-distro-build-tooling/eks-distro-minimal-base-python-builder:3.9-al2 as stunnel_installer
ENV STUNNEL_VERSION="5.58"
RUN yum install -y tar gzip 1.5-10.amzn2.0.1 tar 1.26-35.amzn2.0.3 gcc 7.3.1-17.amzn2 make 1:3.82-24.amzn2 openssl-devel 1:1.0.2k-24.amzn2.0.11 && \
yum -y clean all && rm -rf /var/cache && \
curl -o stunnel-$STUNNEL_VERSION.tar.gz https://www.stunnel.org/archive/5.x/stunnel-$STUNNEL_VERSION.tar.gz && \
tar -zxvf stunnel-$STUNNEL_VERSION.tar.gz && \
cd stunnel-$STUNNEL_VERSION && \
./configure --prefix=/newroot/ && \
make && \
make install && \
mv /newroot/usr/bin/stunnel /newroot/usr/bin/stunnel5 && \
cd - && \
rm -rf stunnel-$STUNNEL_VERSION.tar.gz stunnel-$STUNNEL_VERSION
FROM amazon/aws-efs-csi-driver:v1.7.5
COPY --from=stunnel_installer /newroot /
This is the memory usage before and after using this custom image
Seeing similar behavior using 1.7.4. We see efs-plugin using 35mb and it slowly climbs over a few days to our limit of 350mb and gets oomkilled.
I still see this same behavior in 2.0.2
@sstarcher I've tried the steps others listed above using driver version v2.0.4 and am unable to recreate the issue, memory usage stays relatively flat. Can you describe your setup, how to reproduce the issue, and describe the problematic behavior that you have encountered?
Digging in I realized that we had a different version pinned and our version is not as new as I was expecting. I'll update to 2.0.4 or greater and try again.
@sstarcher Any update on if you could recreate this?
I have not been able to reproduce
I am seeing similar issues running v2.0.7-eksbuild.1
.
Is it safe to set a lower limit and let this continue to OOM?
Hey @TNonet how many pods are you running on your cluster? Also, does the memory usage plateau/cap under 1.5GB, or does it continue to increase substantially? What throughput mode is your EFS using?
We are running thousands of pods, but only ~30-50 of them across a dozen or fewer nodes use EFS.
I have not seen it go above 1.5GB and it does seem to plateau (sometimes at less than a GB), but until recently, I was not tracking this data, so it is possible.
We are using EFS with elastic throughput and general-purpose performance (controlled via terraform)
I see, this may be just due to the proxy process using more memory than efs-utils in versions 2.0+ as it multiplexes connections, we advise to set memory limits in accordance with your workloads, so I'd say keep that increased limit in place. If you see the memory usage continue to climb significantly please update us here
@TNonet have you seen any significant increase in the memory usage?
We have had a few pods hit 2 - 2.5 GBs but nothing beyond that.
What happened?
Possible regression of #474 ?
EKS Managed Add-On Amazon EFS CSI Driver v1.6.0-eksbuild.1 appears to have a memory leak in
efs-plugin
container:What you expected to happen?
Memory is reclaimed during normal operations.
How to reproduce it (as minimally and precisely as possible)?
Deploy a cluster with the latest version EKS Managed Add-On of Amazon EFS CSI Driver (defaults) and enable the use of EFS based PVCs.
Environment
kubectl version
):v1.27.4-eks-2d98532
1.6.0-eksbuild.1
Please also attach debug logs to help us better diagnose
(Full logs captured and available through direct request due to sensitive values)
efs-utils
has several lines such as the following:/kind bug