Karpenter unnecessarily churning pods at steady state, performing incorrect consolidations, likely due to stockouts/race condition

bguo-figma commented 3 months ago

Description

Observed Behavior: Our cluster has one Provisioner that attempts to provision two instance types, x2gd.4xlarge and r6g.8xlarge. It prefers (via cost) to provision x2gd.4xlarge, but due to persistent stockouts in us-west-2, roughly half of our 600 node fleet is using x2gds, and half using r6gs. Each node runs a single pod (ignoring daemonsets), and all pods are configured exactly the same/are from the same replicaset. The node only has capacity to fit one pod based on resource requests.

At steady state (no changing workloads) Karpenter appears to reschedule pods and provision new nodes without a good reason, leading to excessive pod churn and more expensive node allocation. More specifically, at steady state we see

disrupting via consolidation replace, terminating 1 candidates ip-10-21-175-111.us-west-2.compute.internal/r6g.8xlarge/on-demand and replacing with on-demand node from types x2gd.4xlarge

Followed by created machine created launch template removed offering from offerings creating instance, insufficient capacity, with fleet error(s), InsufficientInstanceCapacity: We currently do not have sufficient x2gd.4xlarge capacity in zones with support for 'gp3' volumes. Our system will be working on provisioning additional capacity. deleted machine

All tagged with machine: multiplayer-provisioner-arm-xshpb, instance-types: x2gd.4xlarge

I then see found provisionable pod(s) tagged with figma-apps-multiplayer-production/multiplayer-d69dd575d-ks9br. multiplayer-d69dd575d-ks9br is the pod located on ip-10-21-175-111.us-west-2.compute.internal.

A series of messages next indicate that Karpenter provisioned a new r6g.8xlarge node: created machine launched machine registered machine deleted launch template initialized machine all tagged with machine: multiplayer-provisioner-arm-7tmrl, instance-types: r6g.8xlarge, node: ip-10-21-186-228.us-west-2.compute.internal

Between launched and registered, I see the following log: disrupting via "consolidation", disrupting candidates, launching replacement, timed out checking node readiness, getting machine, Machine.karpenter.sh "multiplayer-provisioner-arm-xshpb" not found. I believe this message to be a red herring (looks like it attempted to health check a machine that never existed multiplayer-provisioner-arm-xshpb (machine 'created' but failed to provision).

After all of that, I very weirdly now see this message: disrupting via consolidation delete, terminating 1 candidates ip-10-21-148-65.us-west-2.compute.internal/x2gd.4xlarge/on-demand.

Before this point, ip-10-21-148-65.us-west-2.compute.internal has not been mentioned at all. That node has an expiry of 30 days, and was provisioned 3 days ago. This node then proceeds to evict the pod running on it (and that pod schedules onto ip-10-21-66-133.us-west-2.compute.internal) and gets deleted. The original node (ip-10-21-175-111.us-west-2.compute.internal) that I believe initiated the replacement is never mentioned again, and the pod originally on that node (and that was mentioned as a provisionable pod) continues to peacefully run on that node.

I believe this could be caused by a race condition between two cycles (a consolidation delete cycle and a consolidation replace cycle), though documentation suggests that what I see shouldn't be happening. More specifically, the hunch is that

We try to consolidate-replace an r6g.8xlarge with an x2gd.4xlarge
Karpenter attempts to provision an x2gd.4xlarge
Karpenter fails to do so, and provisions an r6g.8xlarge
An x2gd.4xlarge undergoes consolidation-delete, since "A node is eligible for deletion if all of its pods can run on free capacity of other nodes in the cluster." and there happens to be brand new free capacity via r6g.8xlarge. This happens even though Karpenter should prefer to disrupt "Nodes running fewer pods", and the new r6g.8xlarge isn't running any pods at all (since it's brand new).

Step 3 is mildly unexpected (why provision an r6g.8xlarge to replace an r6g.8xlarge) though not terrible as long as it deletes right after, and Step 4 is very unexpected (why move your workload for no reason onto a more expensive pod?).

Likely due to the size of the cluster and the 24/7 stock out, we see these unnecessary churns happen consistently at steady state (ie weekend with no deploys/no workload changes).

Expected Behavior:

Karpenter should only attempt to provision instances that are cheaper as part of consolidation-replace. Karpenter should also only attempt to disrupt nodes if moving the workload to a cheaper node. In other words, Karpenter should only attempt to consolidate pods off of r6g nodes, and not ones off of x2gds, since those are already running on the cheapest available hardware. It should not spuriously spin up nodes and churn pods.

Reproduction Steps (Please include YAML): Provisioner:

kind: Provisioner
metadata:
  annotations:
  labels:
  name: multiplayer-provisioner-arm
spec:
  consolidation:
    enabled: true
  limits:
    resources:
      memory: 75000Gi
  providerRef:
    name: multiplayer-arm-template
  requirements:
  - key: eks.figma.io/service_provisioner
    operator: In
    values:
    - multiplayer-provisioner
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - x2gd
    - r6g
  - key: karpenter.k8s.aws/instance-memory
    operator: In
    values:
    - "262144"
  - key: karpenter.k8s.aws/instance-hypervisor
    operator: In
    values:
    - nitro
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-west-2a
    - us-west-2b
    - us-west-2c
  - key: kubernetes.io/arch
    operator: In
    values:
    - arm64
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  ttlSecondsUntilExpired: 2592000
  weight: 100

Replicaset: Any sort of moderately large replicaset, large enough to force constant stockout of x2gds.

Versions:

Chart Version: 0.32.9
Kubernetes Version (kubectl version): 1.26
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

tzneal commented 3 months ago

We try to consolidate-replace an r6g.8xlarge with an x2gd.4xlarge Karpenter attempts to provision an x2gd.4xlarge Karpenter fails to do so, and provisions an r6g.8xlarge An x2gd.4xlarge undergoes consolidation-delete, since "A node is eligible for deletion if all of its pods can run on free capacity of other nodes in the cluster." and there happens to be brand new free capacity via r6g.8xlarge. This happens even though Karpenter should prefer to disrupt "Nodes running fewer pods", and the new r6g.8xlarge isn't running any pods at all (since it's brand new).

Can you provide the full set of Karpenter logs for the time period when this occurred?

tzneal commented 3 months ago

/triage needs-information

bguo-figma commented 3 months ago

{"level":"INFO","time":"2024-06-03T09:14:55.888Z","logger":"controller.disruption","message":"disrupting via consolidation replace, terminating 1 candidates ip-10-21-175-111.us-west-2.compute.internal/r6g.8xlarge/on-demand and replacing with on-demand node from types x2gd.4xlarge","commit":"cc85441"}
{"level":"INFO","time":"2024-06-03T09:14:55.910Z","logger":"controller.disruption","message":"created machine","commit":"cc85441","provisioner":"multiplayer-provisioner-arm","machine":"multiplayer-provisioner-arm-xshpb","requests":{"cpu":"12649m","memory":"210505Mi","pods":"7","vpc.amazonaws.com/pod-eni":"2"},"instance-types":"x2gd.4xlarge"}
{"level":"DEBUG","time":"2024-06-03T09:14:56.079Z","logger":"controller.machine.lifecycle","message":"created launch template","commit":"cc85441","machine":"multiplayer-provisioner-arm-xshpb","provisioner":"multiplayer-provisioner-arm","launch-template-name":"karpenter.k8s.aws/3784564906733431822","id":"lt-0392586d3f9dc6ecb"}
{"level":"DEBUG","time":"2024-06-03T09:14:56.956Z","logger":"controller.machine.lifecycle","message":"removing offering from offerings","commit":"cc85441","machine":"multiplayer-provisioner-arm-xshpb","provisioner":"multiplayer-provisioner-arm","reason":"InsufficientInstanceCapacity","instance-type":"x2gd.4xlarge","zone":"us-west-2b","capacity-type":"on-demand","ttl":"3m0s"}
{"level":"ERROR","time":"2024-06-03T09:14:56.957Z","logger":"controller.machine.lifecycle","message":"creating instance, insufficient capacity, with fleet error(s), InsufficientInstanceCapacity: We currently do not have sufficient x2gd.4xlarge capacity in zones with support for 'gp3' volumes. Our system will be working on provisioning additional capacity.","commit":"cc85441","machine":"multiplayer-provisioner-arm-xshpb","provisioner":"multiplayer-provisioner-arm"}
{"level":"INFO","time":"2024-06-03T09:14:56.977Z","logger":"controller.machine.termination","message":"deleted machine","commit":"cc85441","machine":"multiplayer-provisioner-arm-xshpb","provisioner":"multiplayer-provisioner-arm","node":"","provider-id":""}
{"level":"DEBUG","time":"2024-06-03T09:14:57.463Z","logger":"controller.provisioner","message":"3 out of 774 instance types were excluded because they would breach limits","commit":"cc85441","provisioner":"multiplayer-provisioner-arm"}
{"level":"INFO","time":"2024-06-03T09:14:57.489Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"cc85441","pods":"figma-apps-multiplayer-production/multiplayer-d69dd575d-ks9br, figma-apps-dagster-cloud-agent-production/entity-sync-production-898b81-6f8cfd594b-4r2hx","duration":"38.526593ms"}
{"level":"INFO","time":"2024-06-03T09:14:57.489Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"cc85441","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-06-03T09:14:57.489Z","logger":"controller.provisioner","message":"computed 1 unready node(s) will fit 1 pod(s)","commit":"cc85441"}
{"level":"INFO","time":"2024-06-03T09:14:57.498Z","logger":"controller.provisioner","message":"created machine","commit":"cc85441","provisioner":"multiplayer-provisioner-arm","machine":"multiplayer-provisioner-arm-7tmrl","requests":{"cpu":"12649m","memory":"210505Mi","pods":"7","vpc.amazonaws.com/pod-eni":"2"},"instance-types":"r6g.8xlarge"}
{"level":"INFO","time":"2024-06-03T09:14:58.975Z","logger":"controller.machine.lifecycle","message":"launched machine","commit":"cc85441","machine":"multiplayer-provisioner-arm-7tmrl","provisioner":"multiplayer-provisioner-arm","provider-id":"aws:///us-west-2b/i-0e4f00dc2ea67eb05","instance-type":"r6g.8xlarge","zone":"us-west-2b","capacity-type":"on-demand","allocatable":{"cpu":"31850m","ephemeral-storage":"53Gi","memory":"239495Mi","pods":"234","vpc.amazonaws.com/pod-eni":"54"}}
{"level":"ERROR","time":"2024-06-03T09:15:02.007Z","logger":"controller.disruption","message":"disrupting via \"consolidation\", disrupting candidates, launching replacement, timed out checking node readiness, getting machine, Machine.karpenter.sh \"multiplayer-provisioner-arm-xshpb\" not found","commit":"cc85441"}
{"level":"INFO","time":"2024-06-03T09:15:09.669Z","logger":"controller.machine.lifecycle","message":"initialized machine","commit":"cc85441","machine":"dagster-provisioner-b6b2j","provisioner":"dagster-provisioner","provider-id":"aws:///us-west-2b/i-0fe56e9b0d31a4cd0","node":"ip-10-21-182-175.us-west-2.compute.internal"}
{"level":"DEBUG","time":"2024-06-03T09:15:19.677Z","logger":"controller.disruption","message":"discovered subnets","commit":"cc85441","subnets":["subnet-01ffe8ce304002370 (us-west-2b)","subnet-0c0bc0462f162ab5c (us-west-2a)","subnet-0ec8adb3e7790be7c (us-west-2c)"]}
{"level":"DEBUG","time":"2024-06-03T09:15:22.492Z","logger":"controller.machine.lifecycle","message":"registered machine","commit":"cc85441","machine":"multiplayer-provisioner-arm-7tmrl","provisioner":"multiplayer-provisioner-arm","provider-id":"aws:///us-west-2b/i-0e4f00dc2ea67eb05","node":"ip-10-21-186-228.us-west-2.compute.internal"}
{"level":"DEBUG","time":"2024-06-03T09:15:40.189Z","logger":"controller","message":"deleted launch template","commit":"cc85441","id":"lt-02a1e65110c8a9a06","name":"karpenter.k8s.aws/15611180400545724544"}
{"level":"INFO","time":"2024-06-03T09:15:43.129Z","logger":"controller.machine.lifecycle","message":"initialized machine","commit":"cc85441","machine":"multiplayer-provisioner-arm-7tmrl","provisioner":"multiplayer-provisioner-arm","provider-id":"aws:///us-west-2b/i-0e4f00dc2ea67eb05","node":"ip-10-21-186-228.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-06-03T09:16:00.486Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 candidates ip-10-21-148-65.us-west-2.compute.internal/x2gd.4xlarge/on-demand","commit":"cc85441"}
{"level":"INFO","time":"2024-06-03T09:16:00.526Z","logger":"controller.node.termination","message":"tainted node","commit":"cc85441","node":"ip-10-21-148-65.us-west-2.compute.internal"}
{"level":"DEBUG","time":"2024-06-03T09:16:19.881Z","logger":"controller.node.termination","message":"discovered subnets","commit":"cc85441","node":"ip-10-21-148-65.us-west-2.compute.internal","id":"i-01cbcc5265365baec","subnets":["subnet-01ffe8ce304002370 (us-west-2b)","subnet-0c0bc0462f162ab5c (us-west-2a)","subnet-0ec8adb3e7790be7c (us-west-2c)"]}
{"level":"DEBUG","time":"2024-06-03T09:16:40.191Z","logger":"controller","message":"deleted launch template","commit":"cc85441","id":"lt-0392586d3f9dc6ecb","name":"karpenter.k8s.aws/3784564906733431822"}
{"level":"DEBUG","time":"2024-06-03T09:17:20.682Z","logger":"controller.node.termination","message":"discovered subnets","commit":"cc85441","node":"ip-10-21-148-65.us-west-2.compute.internal","id":"i-01cbcc5265365baec","subnets":["subnet-0c0bc0462f162ab5c (us-west-2a)","subnet-0ec8adb3e7790be7c (us-west-2c)","subnet-01ffe8ce304002370 (us-west-2b)"]}
{"level":"INFO","time":"2024-06-03T09:17:34.296Z","logger":"controller.node.termination","message":"deleted node","commit":"cc85441","node":"ip-10-21-148-65.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-06-03T09:17:34.625Z","logger":"controller.machine.termination","message":"deleted machine","commit":"cc85441","machine":"multiplayer-provisioner-arm-q45kc","provisioner":"multiplayer-provisioner-arm","node":"ip-10-21-148-65.us-west-2.compute.internal","provider-id":"aws:///us-west-2b/i-01cbcc5265365baec"}

tzneal commented 3 months ago

Looking at the logs, I don't see your step 3 occurring as described.

Karpenter fails to do so, and provisions an r6g.8xlarge

{"level":"INFO","time":"2024-06-03T09:14:57.489Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"cc85441","pods":"figma-apps-multiplayer-production/multiplayer-d69dd575d-ks9br, figma-apps-dagster-cloud-agent-production/entity-sync-production-898b81-6f8
cfd594b-4r2hx","duration":"38.526593ms"}

Karpenter has not deleted the ip-10-21-175-111.us-west-2.compute.internal/r6g.8xlarge instance at that point since the launch failed.

I think what happened is:

We try to consolidate-replace an r6g.8xlarge with an x2gd.4xlarge
Karpenter attempts to provision an x2gd.4xlarge and fails.

Two new pods arrive and are pending, Karpenter thinks one will schedule against the x2gd.4xlarge that failed to launch (a race), and one needs a new node, so it launches the r6g.8xlarge since it knows that x2gd.4xlarge is currently unavailable.

{"level":"INFO","time":"2024-06-03T09:14:57.489Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"cc85441","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-06-03T09:14:57.489Z","logger":"controller.provisioner","message":"computed 1 unready node(s) will fit 1 pod(s)","commit":"cc85441"}
{"level":"INFO","time":"2024-06-03T09:14:57.498Z","logger":"controller.provisioner","message":"created machine","commit":"cc85441","provisioner":"multiplayer-provisioner-arm","machine":"multiplayer-provisioner-arm-7tmrl","requests":{"cpu":"12649m","memory":"210505Mi","pods":"7","vpc.amazonaws.com/pod-eni":"2"},"instance-types":"r6g.8xlarge"}

It looks like this was an unlucky race between a failed node launch and pending pods arriving.

bguo-figma commented 3 months ago

I don't think I agree.

Two new pods arrive and are pending
{"level":"INFO","time":"2024-06-03T09:14:57.489Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"cc85441","pods":"figma-apps-multiplayer-production/multiplayer-d69dd575d-ks9br, figma-apps-dagster-cloud-agent-production/entity-sync-production-898b81-6f8
On this log line, multiplayer-d69dd575d-ks9br is already located on ip-10-21-175-111.us-west-2.compute.internal/r6g.8xlarge (confirmed via other observability outside of karpenter) prior to all these logs, and has been for a day or two. The dagster-cloud-agent is unrelated and affects a different provisioner.

In addition, the workload is again a static workload. The replicaset isn't subject to any sort of autoscaling, and the Provisioner we use is dedicated specifically to that workload, so there's no conflating other RSs that could be causing issues. The only time there would be a pending pod is if we deleted the node that a pod was currently running on

This also still doesn't explain why Karpenter deleted an x2gd.4xlarge.

tzneal commented 3 months ago

If you submit a support ticket and include the cluster ARN, I can look at the audit logs for more information on the Pods.

bguo-figma commented 3 months ago

Sounds good, support case submitted

tzneal commented 3 months ago

Alright, I'll close this and work it via the support ticket.

jonathan-innis commented 2 months ago

/reopen

After some discussion, we came to the conclusion that we need to update the behavior of our orchestration queue and how it reacts to self-deletes that happen when an ICE is received during a cloudprovider Create call. What appears to be happening here is that we are self-deleting nodes in the NodeClaim lifecycle controller and seeing that we no longer have a replacement in the orchestration queue here. When this happens, there is a race where the provisioning controller thinks that there is now capacity that needs to be provisioned for the pods that are on the candidate node that is still MarkedForDeletion, causing us to launch a node that we actually won't need when we UnmarkForDeletion one second later.

Effectively, we need to ensure that the UnmarkForDelete that we perform against our candidate nodes happens before we self-delete the new NodeClaim due to an ICE. This will prevent us from re-simulating the pods on the MarkedForDeletion node in the provisioning loop.

We came to the conclusion that we will pull the self-delete logic out of the NodeClaim lifecycle controller and create a separate controller to monitor ICE errors. We'll track internal state on whether the NodeClaim is safe to self-delete right now (a NodeClaim will be blocked for self-delete while the disruption controller is operating against it, as soon as it is removed from the disruption controller, the disruption controller will unmark it, allowing it to be deleted due to an ICE). Normal provisioning flows will work as they do today, where an ICE will be received and a self-delete will occur.

This solution has the added benefit of better-supporting manual NodeClaim creation scenarios -- which are a bit awkward today due to this self-delete behavior (this is similar to a controllerless pod, where we know that nothing will re-create it). Instead of these manual NodeClaims being self-deleted, we will keep them around in whatever failure mode they ended in. Only NodePool-owned NodeClaims will get removed so that they can be recreated, since we now that we can always create more.

k8s-ci-robot commented 2 months ago

@jonathan-innis: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/karpenter/issues/1293#issuecomment-2188181147): >/reopen > >After some discussion, we came to the conclusion that we need to update the behavior of our orchestration queue and how it reacts to self-deletes that happen when an ICE is received during a cloudprovider Create call. What appears to be happening here is that we are self-deleting nodes in the NodeClaim lifecycle controller and seeing that we no longer have a replacement in the orchestration queue [here](https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/orchestration/queue.go#L243). When this happens, there is a race where the provisioning controller thinks that there is now capacity that needs to be provisioned for the pods that are on the candidate node that is still MarkedForDeletion, causing us to launch a node that we actually won't need when we UnmarkForDeletion one second later. > >Effectively, we need to ensure that the UnmarkForDelete that we perform against our candidate nodes happens _before_ we self-delete the new NodeClaim due to an ICE. This will prevent us from re-simulating the pods on the MarkedForDeletion node in the provisioning loop. > >We came to the conclusion that we will pull the self-delete logic out of the NodeClaim lifecycle controller and create a separate controller to monitor ICE errors. We'll track internal state on whether the NodeClaim is safe to self-delete right now (a NodeClaim will be blocked for self-delete while the disruption controller is operating against it, as soon as it is removed from the disruption controller, the disruption controller will unmark it, allowing it to be deleted due to an ICE). Normal provisioning flows will work as they do today, where an ICE will be received and a self-delete will occur. > >This solution has the added benefit of better-supporting manual NodeClaim creation scenarios -- which are a bit awkward today due to this self-delete behavior (this is similar to a controllerless pod, where we know that nothing will re-create it). Instead of these manual NodeClaims being self-deleted, we will keep them around in whatever failure mode they ended in. Only NodePool-owned NodeClaims will get removed so that they can be recreated, since we now that we can always create more. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / karpenter

Karpenter unnecessarily churning pods at steady state, performing incorrect consolidations, likely due to stockouts/race condition #1293

Description