Testing! - Githubissues

trm109 commented 3 weeks ago

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

AWS Region:
Instance Type(s):
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion):
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version):
AMI Version:
Kernel (e.g. uname -a):
Release information (run cat /etc/eks/release on a node):

trm109 commented 3 weeks ago

/summarize awslabs amazon-eks-ami 1145

github-actions[bot] commented 3 weeks ago

Here's a summary of the GitHub issue reply chain:

Issue: Kubernetes nodes on EKS clusters become unresponsive and lose readiness when applications consume too much memory, causing extended downtime. This doesn't occur on GKE clusters.

Key points:

The problem is traced to insufficient memory reservation for kubelet, set by the EKS bootstrap script.
Current EKS memory reservation (574Mi) is much lower than GKE's (1.8Gi).
Increasing kubelet's reserved memory to 1Gi or 1.8Gi resolves the issue.
The problem affects various instance types, including t3.medium and m5.8xlarge.
It occurs with both containerd and docker-based AMIs.

Proposed solutions and workarounds:

Revise kubeReserved values, particularly for memory.
Consider reverting to GKE values as a temporary fix.
Implement custom node logic limiting pods and using GKE memory calculation.
Ensure correct configuration of pod memory limits and requests.
Explore per-pod kube reserved calculation.
Consider implementing a runtime class with pod overheads.

Conclusion: While there's no official fix yet, AWS acknowledges the need to revise kubeReserved values. Users are advised to implement workarounds such as increasing reserved memory, properly configuring pod resources, and considering custom node logic. The issue remains open, with ongoing discussions about the best approach to resolve it permanently.

trm109 commented 3 weeks ago

/summarize awslabs amazon-eks-ami 990

github-actions[bot] commented 3 weeks ago

Here's a summary of the GitHub issue discussion:

Issue: Nodes occasionally fail to boot and get stuck in NotReady state.

Key details:

The issue occurs intermittently, affecting about 1% of nodes
The problem is related to pulling the sandbox (pause) container image during node bootstrap
Logs show the pull-sandbox-image.sh script failing with a "provided file is not a console" error

Likely root cause:

The AWS CLI command to get the ECR login password is failing silently
This results in an empty password being passed to containerd, causing it to prompt for a password on the console and panic

Potential factors:

May be hitting ECR API rate limits when launching many nodes quickly
The script lacks error handling for ECR login failures

Proposed solutions:

Add error handling to the pull-sandbox-image.sh script
Consider baking the pause image directly into the AMI (being addressed in a separate issue #938)

Next steps:

Capture stderr output from the ECR login command next time it occurs
Check ECR API rate limits and usage
Implement proper error handling in the script

The issue remains open pending further investigation and fixes.