kubernetes / cloud-provider-aws

Cloud provider for AWS
https://cloud-provider-aws.sigs.k8s.io/
Apache License 2.0
376 stars 300 forks source link

Karpenter does not terminate instances in Pending state #892

Closed toredash closed 3 months ago

toredash commented 3 months ago

Description

Observed Behavior: High-level: EC2 instances in Pending state are not removed by karpenter.

We are currently experiencing a higher-than-normal of EC2 instances which have hardware issues and are not functional. The instances are in forever Pending state after they have been initial provisioned by Karpenter. As the state of the EC2 instance never transitions from Pending-state to Running-state, we assumed that karpenter would after a while (15min) would mark the instance as not healthy and replace it.

This is a hard-to-reproduce case as one would need to get an instance that stays in Pending-state.

Some background information:

When describing the instance, status fields are either pending or attaching. AWS support confirmed that the physical server had issues. Note the State.Name, BlockDeviceMappings[].EBS.Status, NetworkInterfaces[].Attachment.Status fields from aws ec2 describe-instances: (some data removed)

"AmiLaunchIndex": 0,
"ImageId": "ami-0daf4f79825bf900f",
"InstanceId": "i-078e295d1e5549ea3",
"InstanceType": "i3.4xlarge",
"LaunchTime": "2024-02-21T05:58:45+00:00",
"Monitoring": {
    "State": "disabled"
},
"Placement": {
    "AvailabilityZone": "eu-north-1c",
    "GroupName": "",
    "Tenancy": "default"
},
"State": {
    "Code": 0,
    "Name": "pending"
},
"StateTransitionReason": "",
"BlockDeviceMappings": [
    {
        "DeviceName": "/dev/xvda",
        "Ebs": {
            "AttachTime": "2024-02-21T05:58:46+00:00",
            "DeleteOnTermination": true,
            "Status": "attaching",
            "VolumeId": "vol-01c8e9f683dfa7b06"
        }
    }
],
"ClientToken": "fleet-b9a41f87-b59d-4b3e-8612-0ea00715ca68-0",
"EbsOptimized": false,
"EnaSupport": true,
"Hypervisor": "xen",
"InstanceLifecycle": "spot",
"NetworkInterfaces": [
    {
        "Attachment": {
            "AttachTime": "2024-02-21T05:58:45+00:00",
            "AttachmentId": "eni-attach-0cdbb641145ccf6bc",
            "DeleteOnTermination": true,
            "DeviceIndex": 0,
            "Status": "attaching",
            "NetworkCardIndex": 0
        },
    }
],

"SourceDestCheck": true,
"SpotInstanceRequestId": "sir-9xzpngzn",

The nodeclaim:

Name:         standard-instance-store-x6wxs
Namespace:    
Labels:       karpenter.k8s.aws/instance-category=i
              karpenter.k8s.aws/instance-cpu=16
              karpenter.k8s.aws/instance-encryption-in-transit-supported=false
              karpenter.k8s.aws/instance-family=i3
              karpenter.k8s.aws/instance-generation=3
              karpenter.k8s.aws/instance-hypervisor=xen
              karpenter.k8s.aws/instance-local-nvme=3800
              karpenter.k8s.aws/instance-memory=124928
              karpenter.k8s.aws/instance-network-bandwidth=5000
              karpenter.k8s.aws/instance-size=4xlarge
              karpenter.sh/capacity-type=spot
              karpenter.sh/nodepool=standard-instance-store
              kubernetes.io/arch=amd64
              kubernetes.io/os=linux
              node.kubernetes.io/instance-type=i3.4xlarge
              topology.kubernetes.io/region=eu-north-1
              topology.kubernetes.io/zone=eu-north-1c
Annotations:  karpenter.k8s.aws/ec2nodeclass-hash: 14690241518068856330
              karpenter.k8s.aws/tagged: true
              karpenter.sh/managed-by: X
              karpenter.sh/nodepool-hash: 9268174783651286961
API Version:  karpenter.sh/v1beta1
Kind:         NodeClaim
Metadata:
  Creation Timestamp:  2024-02-21T05:57:38Z
  Finalizers:
    karpenter.sh/termination
  Generate Name:  standard-instance-store-
  Generation:     1
  Owner References:
    API Version:           karpenter.sh/v1beta1
    Block Owner Deletion:  true
    Kind:                  NodePool
    Name:                  standard-instance-store
    UID:                   a2aa544f-3e9f-4e08-b15f-ecd17bd8e512
  Resource Version:        954875751
  UID:                     e8e4e85f-8366-4f82-9f52-e5de137ee79f
Spec:
  Kubelet:
    Cluster DNS:
      10.255.0.10
    System Reserved:
      Cpu:                  250m
      Ephemeral - Storage:  6Gi
      Memory:               200Mi
  Node Class Ref:
    Name:  standard-instance-store
  Requirements:
    Key:       karpenter.k8s.aws/instance-local-nvme
    Operator:  Gt
    Values:
      50
    Key:       karpenter.sh/nodepool
    Operator:  In
    Values:
      standard-instance-store
    Key:       node.kubernetes.io/instance-type
    Operator:  In
    Values:
      c5d.12xlarge
      c5d.18xlarge
      c5d.24xlarge
      c5d.4xlarge
      c5d.9xlarge
      c5d.metal
      g4dn.12xlarge
      g4dn.16xlarge
      g4dn.4xlarge
      g4dn.8xlarge
      g4dn.metal
      g5.12xlarge
      g5.16xlarge
      g5.24xlarge
      g5.48xlarge
      g5.4xlarge
      g5.8xlarge
      i3.16xlarge
      i3.4xlarge
      i3.8xlarge
      i3.metal
      i3en.12xlarge
      i3en.24xlarge
      i3en.6xlarge
      i3en.metal
      i4i.12xlarge
      i4i.16xlarge
      i4i.24xlarge
      i4i.32xlarge
      i4i.4xlarge
      i4i.8xlarge
      i4i.metal
      m5d.12xlarge
      m5d.16xlarge
      m5d.24xlarge
      m5d.4xlarge
      m5d.8xlarge
      m5d.metal
      m6idn.12xlarge
      m6idn.16xlarge
      m6idn.24xlarge
      m6idn.32xlarge
      m6idn.4xlarge
      m6idn.8xlarge
      m6idn.metal
      r5d.12xlarge
      r5d.16xlarge
      r5d.24xlarge
      r5d.4xlarge
      r5d.8xlarge
      r5d.metal
      r5dn.12xlarge
      r5dn.16xlarge
      r5dn.24xlarge
      r5dn.4xlarge
      r5dn.8xlarge
      r5dn.metal
      r6idn.12xlarge
      r6idn.16xlarge
      r6idn.24xlarge
      r6idn.32xlarge
      r6idn.4xlarge
      r6idn.8xlarge
      r6idn.metal
      x2idn.16xlarge
      x2idn.24xlarge
      x2iedn.4xlarge
      x2iedn.8xlarge
    Key:       topology.kubernetes.io/zone
    Operator:  In
    Values:
      eu-north-1c
    Key:       karpenter.sh/capacity-type
    Operator:  In
    Values:
      on-demand
      spot
    Key:       karpenter.k8s.aws/instance-cpu
    Operator:  Gt
    Values:
      15
    Key:       kubernetes.io/arch
    Operator:  In
    Values:
      amd64
    Key:       kubernetes.io/os
    Operator:  In
    Values:
      linux
  Resources:
    Requests:
      Cpu:                  1200m
      Ephemeral - Storage:  1140Mi
      Memory:               2262733312
      Pods:                 14
  Startup Taints:
    Effect:  NoExecute
    Key:     node.cilium.io/agent-not-ready
    Value:   true
Status:
  Allocatable:
    Cpu:                  15640m
    Ephemeral - Storage:  3412483807232
    Memory:               112429Mi
    Pods:                 234
  Capacity:
    Cpu:                  16
    Ephemeral - Storage:  3800G
    Memory:               115558Mi
    Pods:                 234
  Conditions:
    Last Transition Time:  2024-02-21T05:59:36Z
    Message:               StartupTaint "node.cilium.io/agent-not-ready=true:NoExecute" still exists
    Reason:                StartupTaintsExist
    Status:                False
    Type:                  Initialized
    Last Transition Time:  2024-02-21T05:58:45Z
    Status:                True
    Type:                  Launched
    Last Transition Time:  2024-02-21T05:59:36Z
    Message:               StartupTaint "node.cilium.io/agent-not-ready=true:NoExecute" still exists
    Reason:                StartupTaintsExist
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-02-21T05:59:20Z
    Status:                True
    Type:                  Registered
  Image ID:                ami-0daf4f79825bf900f
  Node Name:               ip-10-209-146-79.eu-north-1.compute.internal
  Provider ID:             aws:///eu-north-1c/i-078e295d1e5549ea3
Events:                    <none>

Relevant logs for nodeclaim standard-instance-store-x6wxs:

{
    "level": "INFO",
    "time": "2024-02-21T05:57:38.903Z",
    "logger": "controller.disruption",
    "message": "created nodeclaim",
    "commit": "17d6c05",
    "nodepool": "standard-instance-store",
    "nodeclaim": "standard-instance-store-x6wxs",
    "requests": {
        "cpu": "1200m",
        "ephemeral-storage": "1140Mi",
        "memory": "2262733312",
        "pods": "14"
    },
    "instance-types": "c5d.12xlarge, c5d.18xlarge, c5d.24xlarge, c5d.4xlarge, c5d.9xlarge and 63 other(s)"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:39.964Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "ea2d9f2d-b7a9-4061-b65f-c1721321ee0c",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:40.978Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "b3374597-7339-4cb0-8970-30a58d1629d7",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:42.993Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "17e39ff8-17f3-48f1-986e-04c350a1d027",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:47.008Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "122c3c6d-1173-4dba-b79f-d74fc301864d",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:57:55.022Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "56b23a71-168e-4329-a4d0-b0dcab98c1d1",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "ERROR",
    "time": "2024-02-21T05:58:11.037Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "17d6c05",
    "controller": "nodeclaim.lifecycle",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "standard-instance-store-x6wxs"
    },
    "namespace": "",
    "name": "standard-instance-store-x6wxs",
    "reconcileID": "8d9503d3-1c09-444a-94ac-a12fb6a014a9",
    "error": "launching nodeclaim, creating instance, getting launch template configs, getting launch templates, no instance types satisfy requirements of amis ami-0f58878e44a8ebf11, ami-007e086d128149684, ami-007e086d128149684"
}
{
    "level": "INFO",
    "time": "2024-02-21T05:58:45.711Z",
    "logger": "controller.nodeclaim.lifecycle",
    "message": "launched nodeclaim",
    "commit": "17d6c05",
    "nodeclaim": "standard-instance-store-x6wxs",
    "provider-id": "aws:///eu-north-1c/i-078e295d1e5549ea3",
    "instance-type": "i3.4xlarge",
    "zone": "eu-north-1c",
    "capacity-type": "spot",
    "allocatable": {
        "cpu": "15640m",
        "ephemeral-storage": "3412483807232",
        "memory": "112429Mi",
        "pods": "234"
    }
}
{
    "level": "ERROR",
    "time": "2024-02-21T06:07:44.710Z",
    "logger": "controller.disruption.queue",
    "message": "failed to disrupt nodes, command reached timeout after 10m5.783616799s; waiting for replacement initialization, nodeclaim standard-instance-store-x6wxs not initialized",
    "commit": "17d6c05",
    "command-id": "02d6d2c0-c823-410a-a3cb-1f6479bc2b3c",
    "nodes": "ip-10-209-146-248.eu-north-1.compute.internal"
}

The EC2 node in question in kubernetes:

Name:               ip-10-209-146-79.eu-north-1.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=i3.4xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=eu-north-1
                    failure-domain.beta.kubernetes.io/zone=eu-north-1c
                    k8s.io/cloud-provider-aws=b7aae9ddc981b649535117c46866cfc4
                    karpenter.k8s.aws/instance-category=i
                    karpenter.k8s.aws/instance-cpu=16
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=false
                    karpenter.k8s.aws/instance-family=i3
                    karpenter.k8s.aws/instance-generation=3
                    karpenter.k8s.aws/instance-hypervisor=xen
                    karpenter.k8s.aws/instance-local-nvme=3800
                    karpenter.k8s.aws/instance-memory=124928
                    karpenter.k8s.aws/instance-network-bandwidth=5000
                    karpenter.k8s.aws/instance-size=4xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/nodepool=standard-instance-store
                    karpenter.sh/registered=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-209-146-79.eu-north-1.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=i3.4xlarge
                    topology.kubernetes.io/region=eu-north-1
                    topology.kubernetes.io/zone=eu-north-1c
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.209.146.79
                    karpenter.k8s.aws/ec2nodeclass-hash: 14690241518068856330
                    karpenter.sh/managed-by: X
                    karpenter.sh/nodepool-hash: 9268174783651286961
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 21 Feb 2024 06:59:20 +0100
Taints:             node.cilium.io/agent-not-ready=true:NoExecute
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-209-146-79.eu-north-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 21 Feb 2024 08:48:35 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 21 Feb 2024 08:46:57 +0100   Wed, 21 Feb 2024 06:59:20 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 21 Feb 2024 08:46:57 +0100   Wed, 21 Feb 2024 06:59:20 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 21 Feb 2024 08:46:57 +0100   Wed, 21 Feb 2024 06:59:20 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 21 Feb 2024 08:46:57 +0100   Wed, 21 Feb 2024 06:59:36 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.209.146.79
  InternalDNS:  ip-10-209-146-79.eu-north-1.compute.internal
  Hostname:     ip-10-209-146-79.eu-north-1.compute.internal
Capacity:
  cpu:                16
  ephemeral-storage:  3708852832Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             125680016Ki
  pods:               234
Allocatable:
  cpu:                15640m
  ephemeral-storage:  3410562571544
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             122475920Ki
  pods:               234
System Info:
  Machine ID:                 147fb4aeea144e9b81c7d74e1385102f
  System UUID:                ec201972-0cf5-7dcf-239e-bb62c07f1bed
  Boot ID:                    92c15038-7ea5-49be-b6d3-65739f260338
  Kernel Version:             5.10.209-198.812.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.11
  Kubelet Version:            v1.27.9-eks-5e0fdde
  Kube-Proxy Version:         v1.27.9-eks-5e0fdde
ProviderID:                   aws:///eu-north-1c/i-078e295d1e5549ea3
Non-terminated Pods:          (3 in total)
  Namespace                   Name                              CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                              ------------  ----------  ---------------  -------------  ---
  kube-system                 cilium-jlwk9                      100m (0%)     0 (0%)      10Mi (0%)        0 (0%)         109m
  kube-system                 ebs-csi-node-xg9w9                30m (0%)      0 (0%)      120Mi (0%)       768Mi (0%)     109m
  secrets-store-csi-driver    secrets-store-csi-driver-4xc6s    70m (0%)      500m (3%)   140Mi (0%)       400Mi (0%)     109m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (1%)   500m (3%)
  memory             270Mi (0%)  1168Mi (0%)
  ephemeral-storage  420Mi (0%)  0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

Note that we are using Cilium as the CNI. Cilium will in normal operations remove the taint node.cilium.io/agent-not-ready on the node once the cilium-agent is running on the node. The Cilium operator attempts to attach an additional ENI on the host via ec2:AttachNetworkInterface. AWS Audit log entry below, notice the errorMessage:

{
    "errorCode": "Client.IncorrectState",
    "eventSource": [
      "ec2.amazonaws.com"
    ],
    "errorMessage.keyword": [
      "Instance 'i-078e295d1e5549ea3' is not 'running' or 'stopped'."
    ],
    "eventTime": [
      "2024-02-21T05:59:42.000Z"
    ],
    "errorCode.keyword": [
      "Client.IncorrectState"
    ],
    "requestParameters.instanceId": [
      "i-078e295d1e5549ea3"
    ],
    "errorMessage": [
      "Instance 'i-078e295d1e5549ea3' is not 'running' or 'stopped'."
    ],
}

The strange thing is that the Pending instance seems to be, working, kinda. Pods that use hostNetwork:true are able to run on this instance, and they seem to work. Kubelet is reporting that it is ready. Fetching logs from a pod running on the node fails though: Error from server: Get "https://10.209.146.79:10250/containerLogs/kube-system/cilium-operator-5695bfbb6b-gm9ch/cilium-operator": remote error: tls: internal error

Expected Behavior: I'm not really sure to be honest. The NodeClaim is stuck in Ready:false as Cilium is not removing the taints, as the operator is not able to attach an ENI to the instance. As the EC2 API reports the instance as Pending, I would expect karpenter to mark the node as failed/not working a remove it.

So what I think should happen, is that karpenter would mark EC2 nodes that are in state Pending for more than 15minutes, to be marked as not ready and decommissioned

Reproduction Steps (Please include YAML):

Versions:

k8s-ci-robot commented 3 months ago

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
toredash commented 3 months ago

I forgot to mention that this is a duplicate of https://github.com/aws/karpenter-provider-aws/issues/5706, after a dialoge with AWS Support it was requested that this issue should be filed against kubernetes/cloud-provider-aws

olemarkus commented 3 months ago

CCM doesn't have any role to play in the lifecycle of an instance. I don't really see what CCM could do other than add further taints to the node, marking it not ready. I agree with you that the most reasonable way forward is to have Karpenter cordon, empty, and remove instances that are stuck for too long in given states. Optionally with a flag for enabling/disabling this behavior addressing the concern that vital workloads may already have been deployed to such an instance. However, in your case, it seems like even those Pods that are running on faulty instances are not behaving properly so not sure removing those instances really is that dangerous.

toredash commented 3 months ago

I agree @olemarkus, and I had a hunch this would be the response for my query as well. I'm in a limbo here, I'll see what I can do to get attention with karpenter directly.

cartermckinnon commented 3 months ago

I agree with @olemarkus, this isn't related to CCM; it should be tracked in the reference Karpenter issue.