Closed toredash closed 3 months ago
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
I forgot to mention that this is a duplicate of https://github.com/aws/karpenter-provider-aws/issues/5706, after a dialoge with AWS Support it was requested that this issue should be filed against kubernetes/cloud-provider-aws
CCM doesn't have any role to play in the lifecycle of an instance. I don't really see what CCM could do other than add further taints to the node, marking it not ready. I agree with you that the most reasonable way forward is to have Karpenter cordon, empty, and remove instances that are stuck for too long in given states. Optionally with a flag for enabling/disabling this behavior addressing the concern that vital workloads may already have been deployed to such an instance. However, in your case, it seems like even those Pods that are running on faulty instances are not behaving properly so not sure removing those instances really is that dangerous.
I agree @olemarkus, and I had a hunch this would be the response for my query as well. I'm in a limbo here, I'll see what I can do to get attention with karpenter directly.
I agree with @olemarkus, this isn't related to CCM; it should be tracked in the reference Karpenter issue.
Description
Observed Behavior: High-level: EC2 instances in Pending state are not removed by karpenter.
We are currently experiencing a higher-than-normal of EC2 instances which have hardware issues and are not functional. The instances are in forever Pending state after they have been initial provisioned by Karpenter. As the state of the EC2 instance never transitions from Pending-state to Running-state, we assumed that karpenter would after a while (15min) would mark the instance as not healthy and replace it.
This is a hard-to-reproduce case as one would need to get an instance that stays in Pending-state.
Some background information:
When describing the instance, status fields are either pending or attaching. AWS support confirmed that the physical server had issues. Note the
State.Name
,BlockDeviceMappings[].EBS.Status
,NetworkInterfaces[].Attachment.Status
fields fromaws ec2 describe-instances
: (some data removed)The nodeclaim:
Relevant logs for nodeclaim standard-instance-store-x6wxs:
The EC2 node in question in kubernetes:
Note that we are using Cilium as the CNI. Cilium will in normal operations remove the taint
node.cilium.io/agent-not-ready
on the node once the cilium-agent is running on the node. The Cilium operator attempts to attach an additional ENI on the host via ec2:AttachNetworkInterface. AWS Audit log entry below, notice theerrorMessage
:The strange thing is that the Pending instance seems to be, working, kinda. Pods that use hostNetwork:true are able to run on this instance, and they seem to work. Kubelet is reporting that it is ready. Fetching logs from a pod running on the node fails though:
Error from server: Get "https://10.209.146.79:10250/containerLogs/kube-system/cilium-operator-5695bfbb6b-gm9ch/cilium-operator": remote error: tls: internal error
Expected Behavior: I'm not really sure to be honest. The NodeClaim is stuck in Ready:false as Cilium is not removing the taints, as the operator is not able to attach an ENI to the instance. As the EC2 API reports the instance as Pending, I would expect karpenter to mark the node as failed/not working a remove it.
So what I think should happen, is that karpenter would mark EC2 nodes that are in state Pending for more than 15minutes, to be marked as not ready and decommissioned
Reproduction Steps (Please include YAML):
Versions:
v0.34.0
kubectl version
):Server Version: v1.27.9-eks-5e0fdde