Open ibalat opened 2 months ago
This issue is currently awaiting triage.
If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
omg, after 6h later, still pods at "Terminating" status and node is "NotReady".
btw, instance is m5.large. And I got new instance stdout logs:
[ 8080.945657] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/config/mysql/1 supports timestamps until 2038 (0x7fffffff)
[ 8080.956982] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/template-sql/mysql/2 supports timestamps until 2038 (0x7fffffff)
[ 8080.970168] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/template-sql/mysql/3 supports timestamps until 2038 (0x7fffffff)
[ 8080.981712] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/etl-sql/mysql/4 supports timestamps until 2038 (0x7fffffff)
[ 8080.993163] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/prefera-sql/mysql/5 supports timestamps until 2038 (0x7fffffff)
[ 8112.949302] pci 0000:00:1d.0: [1d0f:8061] type 00 class 0x010802
[ 8112.952794] pci 0000:00:1d.0: reg 0x10: [mem 0x00000000-0x00003fff]
[ 8112.956559] pci 0000:00:1d.0: enabling Extended Tags
[ 8112.960301] pci 0000:00:1d.0: BAR 0: assigned [mem 0xc0114000-0xc0117fff]
[ 8112.964132] nvme nvme3: pci function 0000:00:1d.0
[ 8112.967238] nvme 0000:00:1d.0: enabling device (0000 -> 0002)
[ 8112.972352] PCI Interrupt Link [LNKA] enabled at IRQ 11
[ 8112.980317] nvme nvme3: 2/0/0 default/read/poll queues
[ 8113.229053] pci 0000:00:1c.0: [1d0f:8061] type 00 class 0x010802
[ 8113.232693] pci 0000:00:1c.0: reg 0x10: [mem 0x00000000-0x00003fff]
[ 8113.236424] pci 0000:00:1c.0: enabling Extended Tags
[ 8113.240326] pci 0000:00:1c.0: BAR 0: assigned [mem 0xc0118000-0xc011bfff]
[ 8113.244141] nvme nvme4: pci function 0000:00:1c.0
[ 8113.247190] nvme 0000:00:1c.0: enabling device (0000 -> 0002)
[ 8113.256918] nvme nvme4: 2/0/0 default/read/poll queues
[ 8113.573770] EXT4-fs (nvme3n1): mounted filesystem with ordered data mode. Opts: (null)
[ 8114.159309] IPv6: ADDRCONF(NETDEV_CHANGE): enia89b8c83c9a: link becomes ready
[ 8114.163261] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 8114.319775] xfs filesystem being remounted at /var/lib/kubelet/pods/177f12fc-a42d-464f-bdf0-ad1f53080f8b/volume-subpaths/scripts/kafka/2 supports timestamps until 2038 (0x7fffffff)
[ 8114.734723] EXT4-fs (nvme4n1): mounted filesystem with ordered data mode. Opts: (null)
[ 8114.972359] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 8115.074504] xfs filesystem being remounted at /var/lib/kubelet/pods/ad839521-93a0-4010-8b3f-0980d2375063/volume-subpaths/scripts/kafka/2 supports timestamps until 2038 (0x7fffffff)
[ 8119.176023] pci 0000:00:1b.0: [1d0f:8061] type 00 class 0x010802
[ 8119.179548] pci 0000:00:1b.0: reg 0x10: [mem 0x00000000-0x00003fff]
[ 8119.183285] pci 0000:00:1b.0: enabling Extended Tags
[ 8119.187010] pci 0000:00:1b.0: BAR 0: assigned [mem 0xc011c000-0xc011ffff]
[ 8119.190838] nvme nvme5: pci function 0000:00:1b.0
[ 8119.193879] nvme 0000:00:1b.0: enabling device (0000 -> 0002)
[ 8119.203356] nvme nvme5: 2/0/0 default/read/poll queues
[ 8120.146390] EXT4-fs (nvme5n1): mounted filesystem with ordered data mode. Opts: (null)
[ 8120.658980] IPv6: ADDRCONF(NETDEV_CHANGE): eni61bfec53e4d: link becomes ready
[ 8120.662926] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 8121.038972] xfs filesystem being remounted at /var/lib/kubelet/pods/9d3cd637-f153-4200-a302-04b9e60a273c/volume-subpaths/scripts/kafka/2 supports timestamps until 2038 (0x7fffffff)
[ 8299.855030] systemd-journald[537510]: File /var/log/journal/ec23eae178c2480d1224169d16678fc2/system.journal corrupted or uncleanly shut down, renaming and replacing.
If you're willing to try Karpenter 1.0 (newly released), you might see better behavior or diagnostics. I'd give it a go, honestly.
@sftim thanks for suggestion, I'll try it but why karpenter or K8S doesn't intervene this situation? 18h passed and they are still waiting NotReady and Terminating. Is there any parameter to force terminate notready nodes? ttlAfterNotRegistered
parameter deprecated and my consolidateAfter: 5m
config not working for this situation :/
HI @ibalat, From the information that you have shared, it seems like the node registered but never got initialized. Karpenter handles registration failures by waiting for 15 minutes to check if the node registers, if it doesn't then we go ahead and delete the nodeClaim. But we still have an open issue for nodes that Karpenter never initializes at all, which should be captured by https://github.com/kubernetes-sigs/karpenter/issues/750 where we are hoping to start by introducing a static TTL for initialization to kill off nodes that don't ever go Ready on startup. Can you describe the nodeClaim for this node and share it? Can you also share the logs from the time this happened so that we can confirm that's the issue?
hi @jigisha620 , actually, nodes had initialized because these nodes are becoming "Ready", then pods are being scheduling and finally after a while (~30-60mins later) node is passing "NotReady" status. So, they work properly for a while. I tried to upgrade v1.0.0 but still same problem occur. I am sharing my nodeclass, nodepool and nodeclaim configs. Btw, do you know why pods still waiting at "Terminating" status? K8s or karpenter can force delete them after a while? Is there any config for that?
Also I found newly events, maybe they are related with this issue. Their repeat count so much
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: main
spec:
amiSelectorTerms:
- alias: al2023@latest
role: "KarpenterNodeRole"
subnetSelectorTerms:
%{~ for subnet in eks_dev_v1_subnet_ids ~}
- id: "${subnet}"
%{~ endfor ~}
securityGroupSelectorTerms:
- name: "*dev-v1-node*"
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: main-green
spec:
template:
metadata:
labels:
node-group-name: main-green
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: main
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: [ "r5", "m5", "c6i" ]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
terminationGracePeriod: 5m
expireAfter: 720h # 30 * 24h = 720h | periodically recycle nodes due to security concerns
limits:
cpu: 1000
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 5m
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
annotations:
karpenter.k8s.aws/ec2nodeclass-hash: "17843341971500854913"
karpenter.k8s.aws/ec2nodeclass-hash-version: v3
creationTimestamp: "2024-08-15T10:59:10Z"
finalizers:
- karpenter.k8s.aws/termination
generation: 1
name: main
resourceVersion: "525655958"
uid: 742b9052-735a-4078-b2d3-bbfe0cf883e3
spec:
amiSelectorTerms:
- alias: al2023@latest
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1
httpTokens: required
role: KarpenterNodeRole
securityGroupSelectorTerms:
- name: '*dev-v1-node*'
subnetSelectorTerms:
- id: subnet-xx
- id: subnet-xx
- id: subnet-xx
status:
amis:
- id: ami-0d43f736643876936
name: amazon-eks-node-al2023-arm64-standard-1.30-v20240807
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
- id: ami-0d694ee9037e1f937
name: amazon-eks-node-al2023-x86_64-standard-1.30-v20240807
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
conditions:
- lastTransitionTime: "2024-08-15T10:59:11Z"
message: ""
reason: AMIsReady
status: "True"
type: AMIsReady
- lastTransitionTime: "2024-08-15T10:59:11Z"
message: ""
reason: InstanceProfileReady
status: "True"
type: InstanceProfileReady
- lastTransitionTime: "2024-08-15T10:59:11Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-08-15T10:59:11Z"
message: ""
reason: SecurityGroupsReady
status: "True"
type: SecurityGroupsReady
- lastTransitionTime: "2024-08-15T10:59:11Z"
message: ""
reason: SubnetsReady
status: "True"
type: SubnetsReady
instanceProfile: dev-v1_xx
securityGroups:
- id: sg-xx
name: dev-v1-xx
- id: sg-xx
name: dev-v1-xx
subnets:
- id: subnet-xx
zone: eu-west-1c
zoneID: euw1-az2
- id: subnet-xx
zone: eu-west-1a
zoneID: euw1-az3
- id: subnet-xx
zone: eu-west-1b
zoneID: euw1-az1
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
karpenter.sh/nodepool-hash: "14203437024067510703"
karpenter.sh/nodepool-hash-version: v3
creationTimestamp: "2024-08-15T10:55:03Z"
generation: 1
name: main-green
resourceVersion: "525888522"
uid: 5866c52d-bb13-479f-b034-822128ebc8f1
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 5m
consolidationPolicy: WhenEmptyOrUnderutilized
limits:
cpu: 1000
template:
metadata:
labels:
node-group-name: main-green
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: main
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- r5
- m5
- c6i
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "2"
terminationGracePeriod: 5m
status:
conditions:
- lastTransitionTime: "2024-08-15T10:59:11Z"
message: ""
reason: NodeClassReady
status: "True"
type: NodeClassReady
- lastTransitionTime: "2024-08-15T10:59:11Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-08-15T10:55:03Z"
message: ""
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
resources:
cpu: "294"
ephemeral-storage: 417873520Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 695806732Ki
nodes: "20"
pods: "2425"
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
annotations:
compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
compatibility.karpenter.k8s.aws/kubelet-drift-hash: "15379597991425564585"
karpenter.k8s.aws/ec2nodeclass-hash: "17843341971500854913"
karpenter.k8s.aws/ec2nodeclass-hash-version: v3
karpenter.k8s.aws/tagged: "true"
karpenter.sh/nodepool-hash: "14203437024067510703"
karpenter.sh/nodepool-hash-version: v3
creationTimestamp: "2024-08-15T12:05:33Z"
finalizers:
- karpenter.sh/termination
generateName: main-green-
generation: 1
labels:
karpenter.k8s.aws/instance-category: c
karpenter.k8s.aws/instance-cpu: "32"
karpenter.k8s.aws/instance-cpu-manufacturer: intel
karpenter.k8s.aws/instance-ebs-bandwidth: "10000"
karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
karpenter.k8s.aws/instance-family: c6i
karpenter.k8s.aws/instance-generation: "6"
karpenter.k8s.aws/instance-hypervisor: nitro
karpenter.k8s.aws/instance-memory: "65536"
karpenter.k8s.aws/instance-network-bandwidth: "12500"
karpenter.k8s.aws/instance-size: 8xlarge
karpenter.sh/capacity-type: spot
karpenter.sh/nodepool: main-green
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node-group-name: main-green
node.kubernetes.io/instance-type: c6i.8xlarge
topology.k8s.aws/zone-id: euw1-az1
topology.kubernetes.io/region: eu-west-1
topology.kubernetes.io/zone: eu-west-1b
name: main-green-7rncx
ownerReferences:
- apiVersion: karpenter.sh/v1
blockOwnerDeletion: true
kind: NodePool
name: main-green
uid: 5866c52d-bb13-479f-b034-822128ebc8f1
resourceVersion: "525859504"
uid: bd1aea84-18be-4d42-9c17-3936137c89a5
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: main
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: node.kubernetes.io/instance-type
operator: In
values:
- c6i.12xlarge
- c6i.16xlarge
- c6i.24xlarge
- c6i.32xlarge
- c6i.8xlarge
- c6i.metal
- m5.12xlarge
- m5.16xlarge
- m5.24xlarge
- m5.4xlarge
- m5.8xlarge
- m5.metal
- r5.12xlarge
- r5.16xlarge
- r5.24xlarge
- r5.4xlarge
- r5.8xlarge
- r5.metal
- key: node-group-name
operator: In
values:
- main-green
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "2"
- key: karpenter.sh/nodepool
operator: In
values:
- main-green
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- c6i
- m5
- r5
resources:
requests:
cpu: 4280m
memory: 36152Mi
pods: "67"
terminationGracePeriod: 5m0s
status:
allocatable:
cpu: 31850m
ephemeral-storage: 17Gi
memory: 57691Mi
pods: "234"
vpc.amazonaws.com/pod-eni: "84"
capacity:
cpu: "32"
ephemeral-storage: 20Gi
memory: 60620Mi
pods: "234"
vpc.amazonaws.com/pod-eni: "84"
conditions:
- lastTransitionTime: "2024-08-15T12:15:35Z"
message: ""
reason: ConsistentStateFound
status: "True"
type: ConsistentStateFound
- lastTransitionTime: "2024-08-15T15:46:53Z"
message: ""
reason: Consolidatable
status: "True"
type: Consolidatable
- lastTransitionTime: "2024-08-15T12:06:14Z"
message: ""
reason: Initialized
status: "True"
type: Initialized
- lastTransitionTime: "2024-08-15T12:05:35Z"
message: ""
reason: Launched
status: "True"
type: Launched
- lastTransitionTime: "2024-08-15T12:06:14Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-08-15T12:06:04Z"
message: ""
reason: Registered
status: "True"
type: Registered
imageID: ami-0d694ee9037e1f937
lastPodEventTime: "2024-08-15T15:41:53Z"
nodeName: ip-10-xx-xx-xx.eu-west-1.compute.internal
providerID: aws:///eu-west-1b/i-xxxxxx
I think that the snippet that you have shared with "No allowed disruptions for disruption reason" is not the problem here. The nodes that you have, were already in NotReady
state so they will not be considered for allowed disruptions. Can you share Karpenter controller logs from the same time?
sure, between 05:58:24 and 06:09:12 3 nodes became NotReady and I saw them lively. But no related log :( You can see all logs between these times:
{"level":"INFO","time":"2024-08-16T05:58:24.287Z","logger":"controller","message":"created nodeclaim",
{"level":"INFO","time":"2024-08-16T05:58:26.268Z","logger":"controller","message":"launched nodeclaim",
{"level":"INFO","time":"2024-08-16T05:58:54.219Z","logger":"controller","message":"pod(s) have a preferred Anti-Affinity which can prevent consolidation",
{"level":"INFO","time":"2024-08-16T05:58:54.360Z","logger":"controller","message":"found provisionable pod(s)",
{"level":"INFO","time":"2024-08-16T05:58:54.360Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)",
{"level":"INFO","time":"2024-08-16T05:58:54.360Z","logger":"controller","message":"computed 1 unready node(s) will fit 1 pod(s)",
{"level":"INFO","time":"2024-08-16T05:58:54.376Z","logger":"controller","message":"created nodeclaim",
{"level":"INFO","time":"2024-08-16T05:58:56.599Z","logger":"controller","message":"deleted node",
{"level":"INFO","time":"2024-08-16T05:58:56.870Z","logger":"controller","message":"launched nodeclaim",
{"level":"INFO","time":"2024-08-16T05:58:56.902Z","logger":"controller","message":"deleted nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:19.838Z","logger":"controller","message":"registered nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:20.169Z","logger":"controller","message":"registered nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:24.803Z","logger":"controller","message":"pod(s) have a preferred Anti-Affinity which can prevent consolidation",
{"level":"INFO","time":"2024-08-16T05:59:37.493Z","logger":"controller","message":"initialized nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:38.378Z","logger":"controller","message":"initialized nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:49.497Z","logger":"controller","message":"deleted node",
{"level":"INFO","time":"2024-08-16T05:59:49.706Z","logger":"controller","message":"deleted nodeclaim",
{"level":"INFO","time":"2024-08-16T06:08:45.766Z","logger":"controller","message":"found provisionable pod(s)",
{"level":"INFO","time":"2024-08-16T06:08:45.766Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)",
{"level":"INFO","time":"2024-08-16T06:08:45.777Z","logger":"controller","message":"created nodeclaim",
{"level":"INFO","time":"2024-08-16T06:08:48.176Z","logger":"controller","message":"launched nodeclaim",
{"level":"INFO","time":"2024-08-16T06:09:12.703Z","logger":"controller","message":"registered nodeclaim",
new update: not deletable node (although terminationGracePeriod: 5m and passed more time) show some events, maybe it can help
Node's nodeclaim have events below:
pods in node are waiting "Terminating" state and don't have any event or log at describe.
After I deleted nodeclaim manually, node deleted (But passed graceperiodtime).
TerminationGracePeriod
would not work if delete has not been called against the nodeClaim. In your case node went to NotReady
state but nothing initiated it's deletion. I was able to reproduce something similar on my end where my node becomes NotReady
due to Kubelet stopped posting node status
. However, pods got rescheduled onto a different node. That makes me wonder if the pods you are running have some pre-stop hook that's preventing them from terminating?
No prestop hook, finalizer or another thing. Just waiting like at screenshots.
This is not necessarily an issue from Karpenter. To investigate further, we will have to take a look at the kubelet logs to know why pods remained stuck at Terminating. Since you are using an eks ami, you can run a script that's on your worker node at /etc/eks called log-collector-script which would help us get the kubelet logs. If you have AWS premium support then you can open a ticket to investigate those logs or you can send them over and I can try looking into them.
when it happens, I couldn't login EC2, it doesn't response. But I could get stdout, it below.
[ 423.390531] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod1344160e_dca0_4e9d_be15_ea0b63efb5b2.slice/cri-containerd-496edffa072b6d7835989a0dfbce3c3071
1a32903c757baf4fcd460c9479f3a8.scope,task=java,pid=22199,uid=1001
[ 423.412634] Out of memory: Killed process 22199 (java) total-vm:3657848kB, anon-rss:338056kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:964kB oom_score_adj:1000
[ 425.563371] oom_reaper: reaped process 22199 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2024-08-14T13:38:15+00:00
we see this too many times
@ibalat @suraj2410
What do the disk IOPS, disk idle time, and memory metrics look like for the affected hosts? Could this be the problem described in https://github.com/bottlerocket-os/bottlerocket/issues/4075#issuecomment-2319361813? (applicable to Bottlerocket, but also observed with AL2).
I had removed karpenter and reinstalled cluster autoscaler. But I can test it again in this week. After test, I will share results with you
Description
Observed Behavior:
{"level":"INFO","time":"2024-08-14T12:13:23.794Z","logger":"controller","message":"pod xxxx has a preferred Anti-Affinity which can prevent consolidation","commit":"490ef94","controller":"provisioner"}
Expected Behavior:
Reproduction Steps (Please include YAML): I don't have any idea. It occur periodically
Versions:
kubectl version
): 1.30