The `batchIdleDuration` value is ignored

MauroSoli commented 7 months ago

Description

Observed Behavior:

Karpenter always wait to the batchMaxDuration value and ignore the batchIdleDuration value when different pods are in pending status.

I change the above values to batchIdleDuration=10s and batchMaxDuration=90s so it's clearer the behaviour.

As you can see in the following image, pods are in pending state after more than 60 seconds and karpenter was not scheduling new nodeClaim yet.

this

Here are the controller logs:

{"level":"DEBUG","time":"2024-04-08T17:36:48.408Z","logger":"controller.nodeclass","message":"discovered security groups","commit":"17dd42b","ec2nodeclass":"default","security-groups":["sg-*******"]}
{"level":"DEBUG","time":"2024-04-08T17:36:48.417Z","logger":"controller.nodeclass","message":"discovered kubernetes version","commit":"17dd42b","ec2nodeclass":"default","version":"1.29"}
{"level":"DEBUG","time":"2024-04-08T17:36:48.421Z","logger":"controller.pricing","message":"updated spot pricing with instance types and offerings","commit":"17dd42b","instance-type-count":762,"offering-count":1927}
{"level":"DEBUG","time":"2024-04-08T17:36:48.605Z","logger":"controller.nodeclass","message":"discovered amis","commit":"17dd42b","ec2nodeclass":"default","ids":"ami-00a591fc812caa612, ami-0fba9088684638922, ami-0fba9088684638922, ami-0ec163aec4a900dc0","count":4}
{"level":"DEBUG","time":"2024-04-08T17:36:49.553Z","logger":"controller.pricing","message":"updated on-demand pricing","commit":"17dd42b","instance-type-count":706}
{"level":"DEBUG","time":"2024-04-08T17:36:50.692Z","logger":"controller.disruption","message":"discovered instance types","commit":"17dd42b","count":704}
{"level":"DEBUG","time":"2024-04-08T17:36:50.850Z","logger":"controller.disruption","message":"discovered offerings for instance types","commit":"17dd42b","instance-type-count":706}
{"level":"DEBUG","time":"2024-04-08T17:36:50.850Z","logger":"controller.disruption","message":"discovered zones","commit":"17dd42b","zones":["eu-central-1c","eu-central-1b","eu-central-1a"]}
{"level":"DEBUG","time":"2024-04-08T17:38:51.395Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)","subnet-******** (eu-central-1a)"]}
{"level":"DEBUG","time":"2024-04-08T17:45:31.914Z","logger":"controller.provisioner","message":"73 out of 704 instance types were excluded because they would breach limits","commit":"17dd42b","nodepool":"nodepool-eu-central-1a"}
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"default/pause1, default/pause2, default/pause3, default/pause4, default/pause5 and 5 other(s)","duration":"84.72464ms"}
{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"17dd42b","nodeclaims":1,"pods":10}
{"level":"INFO","time":"2024-04-08T17:45:31.952Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"17dd42b","nodepool":"nodepool-eu-central-1a","nodeclaim":"nodepool-eu-central-1a-8rzln","requests":{"cpu":"210m","memory":"41200Mi","pods":"14"},"instance-types":"c6a.12xlarge, c6a.16xlarge, c6a.24xlarge, c6a.32xlarge, c6a.8xlarge and 55 other(s)"}
{"level":"DEBUG","time":"2024-04-08T17:45:32.461Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","launch-template-name":"karpenter.k8s.aws/12721094165069015774","id":"lt-0b41adc259308f38f"}
{"level":"DEBUG","time":"2024-04-08T17:45:34.326Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"17dd42b"}
{"level":"INFO","time":"2024-04-08T17:45:34.478Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","instance-type":"z1d.3xlarge","zone":"eu-central-1a","capacity-type":"spot","allocatable":{"cpu":"11900m","ephemeral-storage":"224Gi","memory":"88002Mi","pods":"110","vpc.amazonaws.com/pod-eni":"54"}}
{"level":"DEBUG","time":"2024-04-08T17:45:55.545Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1a)","subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)"]}
{"level":"INFO","time":"2024-04-08T17:46:11.252Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","node":"ip-10-1-6-176.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-04-08T17:46:24.385Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"17dd42b","nodeclaim":"nodepool-eu-central-1a-8rzln","provider-id":"aws:///eu-central-1a/i-0a8468578de536001","node":"ip-10-1-6-176.eu-central-1.compute.internal","allocatable":{"cpu":"11900m","ephemeral-storage":"240506825124","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"94711256Ki","pods":"110"}}
{"level":"DEBUG","time":"2024-04-08T17:46:57.299Z","logger":"controller.disruption","message":"discovered subnets","commit":"17dd42b","subnets":["subnet-******** (eu-central-1b)","subnet-******** (eu-central-1c)","subnet-******** (eu-central-1a)"]}

Expected Behavior:

The NodeClaim should be created when the batchIdleDuration time is passed and no new pending pods have been scheduled on cluster.

Versions:

Chart Version: 0.35.4
Kubernetes Version (kubectl version): v1.29.1-eks-508b6b3
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jigisha620 commented 7 months ago

@MauroSoli Just to be sure, would you mind verifying if BATCH_MAX_DURATION and BATCH_IDLE_DURATION are being set properly. You can do this by running kubectl describe pod <karpenter pod>. Also there is not much that we can see from the logs that you have shared. Can you share logs with message found provisionable pod(s) to see when the actual scheduling for pods happened?

What is the size of the cluster that you are working with? Is it a test cluster that does not have too many pods/deployments? You have mentioned that the pod is in pending state for more than 60 sec. Given that the node can take some time to come up, do you expect to see the pods scheduled right after 10 seconds?

MauroSoli commented 7 months ago

Just to be sure, would you mind verifying if BATCH_MAX_DURATION and BATCH_IDLE_DURATION are being set properly. You can do this by running kubectl describe pod

$ kubectl describe pod karpenter-678d69d4d5-6rpgw -n karpenter | grep BATCH_
      BATCH_MAX_DURATION:  90s
      BATCH_IDLE_DURATION: 10s

Also there is not much that we can see from the logs that you have shared. Can you share logs with message found provisionable pod(s) to see when the actual scheduling for pods happened?

In the logs that I've shared there is the log that you are searching for:

{"level":"INFO","time":"2024-04-08T17:45:31.936Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"default/pause1, default/pause2, default/pause3, default/pause4, default/pause5 and 5 other(s)","duration":"84.72464ms"}

What is the size of the cluster that you are working with? Is it a test cluster that does not have too many pods/deployments?

We use Karpenter only to manage some specific workloads, like building or running cron jobs. The cluster-autoscaler manages all other workload.

You have mentioned that the pod is in pending state for more than 60 sec. Given that the node can take some time to come up, do you expect to see the pods scheduled right after 10 seconds?

That's not what I was meaning.\ After 60 seconds, while the pods were pending, karpenter was not scheduling new nodeClaim yet. Which instead happened after 90 seconds.

jigisha620 commented 7 months ago

Let's say you scheduled a single pod and until 10 seconds there is no new pod then Karpenter will begin scheduling new nodeClaim after 10 seconds. However if a pod comes up before 10 seconds the batching window will be extended up to the maxDuration.

From the logs that you have shared it seems like there were 10 pods and in that case Karpenter would wait until the batchMaxDuration of 90s before it begins scheduling a nodeClaim.

lesinigo commented 7 months ago

I've been able to reproduce this using the latest Karpenter v0.35.4 and the latest EKS v1.29.1-eks-508b6b3 and it seems to me there could be some misunderstanding between what is stated above, what is stated in the documentation and what is the actual Karpenter behavior.

If I set high and far enough times for IDLE_DURATION and MAX_DURATION (let's say 10 seconds and 300 seconds) and run a script like for I in $(seq 10); do kubectl run <single_pod>; sleep 2; done, the following happens:

new pods are created
the "nothing happens" interval between one pod and the next is always less than 10 seconds
the whole pods creation takes more than IDLE_DURATION
after the last pod gets created, Karpenter waits until 300 seconds has passed since the time of creation of the first pod
Karpenter creates new nodeclaim(s) to fit all the pending pods

In other words: the batch window gets immediately extended to BATCH_MAX_DURATION after the creation of the second pod is interrupting the first BATCH_IDLE_DURATION interval.

Personally, I understood the current documentation in a different way. It says "_BATCH_IDLE_DURATION The maximum amount of time with no new pending pods that if exceeded ends the current batching window. If pods arrive faster than this time, the batching window will be extended up to the maxDuration. If they arrive slower, the pods will be batched separately._". I understood it as follows:

one or more new pods are created
Karpenter notices and starts waiting the IDLE duration
if nothing new happens, the batch window is immediately closed and node claims are computed
otherwise, if new pods are created before BATCH_IDLE_DURATION, jump to 2 (starting a new wait for IDLE duration)
if pods keeps getting created before an IDLE_DURATION ends, Karpenter will keep extending the batch window at blocks of IDLE_DURATION unless the whole window (defined as now - time of first pod created) exceeds BATCH_MAX_DURATION
if the window extends more than BATCH_MAX_DURATION, it is immediatly closed, node claims are computed, and the whole process starts again from 1

In other words: BATCH_IDLE_DURATION will always be the maximum "inactivity time" after which Karpenter starts computing node claims. To me, this makes much more sense because it allows shorter latencies between workload demands (pods creation) and its execution (node claims computed, nodes created, pods start running) while the MAX_DURATION still guarantees a maximum limit to said latency by closing batch windows even if new pods keep arriving.

To me, the documentation seems correct and describes the behavior I would expect, but Karpenter is misbehaving and actually uses the IDLE_DURATION "only the first time", then skipping directly to the MAX_DURATION and thus causing higher latencies between the pods creation and the nodes starting.

jigisha620 commented 7 months ago

I had probably misunderstood what was being implied earlier. I was able to reproduce this. Looking into creating a fix for this. Thanks.

jonathan-innis commented 7 months ago

/assign @jigisha620

billrayburn commented 5 months ago

/label v1

k8s-ci-robot commented 5 months ago

@billrayburn: The label(s) /label v1 cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to [this](https://github.com/kubernetes-sigs/karpenter/issues/1173#issuecomment-2179429982): >/label v1 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

MauroSoli commented 1 month ago

/remove-lifecycle stale

k8s-triage-robot commented 4 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-sigs / karpenter

The `batchIdleDuration` value is ignored #1173

Description