Open fullykubed opened 3 weeks ago
This issue is currently awaiting triage.
If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
I believe I have identified the issue.
We have a few pods in our cluster with a topology spread constraint over node.kubernetes.io/instance-type
:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: `node.kubernetes.io/instance-type`
whenUnsatisfiable: `DoNotSchedule`
We add this to pods that we allow scheduling on spot instances as we want to avoid a disruption to a spot scale-in event for a single instance type (something we have had impact us in the past).
However, the current way that Karpenter logic works for topology spread constraints selects a single, random domain from the eligible domains for the requirement it adds to the nodeclaim (reference).
As a result, the nodeclaim that gets generated will be locked to a single, random allowable instance type, regardless of whether that instance type is 100x too large for the request.
I am not sure why this is the current Karpenter behavior? This seems intentional given it is explicitly called out in the comments, but it also seems like the logic could (and arguably should) allow all eligible domains for maximum flexibility?
At the very least, it seems like the current logic makes topology spread constraints somewhat dangerous to use in specific scenarios which I believe deserves a callout in the documentation.
Could it be because of https://github.com/kubernetes-sigs/karpenter/issues/1239?
Description
Note that I am cross-posting this from https://github.com/aws/karpenter-provider-aws/issues/7254 as the more I look into the issue, the more it seems to be related to core Karpenter logic rather than something on AWS's end.
Observed Behavior:
Occasionally, Karpenter will provision a node that is far, far above what is being requested.
For example, notice the provisioned node below is 10x larger than what is being requested. Moreover, the generated nodeclaim only has a single entry for
instance-types
.That is despite the NodePool (manifest below) having many, many instances types that would fit the scheduling request (which it normally does).
Expected Behavior:
When a set of pods is pending and needs a new node, the generated node claim includes all applicable
instance-types
and an appropriately sized node is created.This normally works correctly and generates logs as follows:
Reproduction Steps (Please include YAML):
It is unclear to me how to reproduce. I have tried all the obvious things and am not able to reliability re-trigger the behavior (it seems to occur somewhat randomly):
I have also verified that the pods do not have any scheduling constraints that would limit them to a single instance type.
In fact, which particular type is chosen for
instance-types
seems somewhat random. Sometimes it is appropriately sized, sometimes it is 10x too large, sometimes it is 100x too large. The instance families also differ. However, what is consistent is the the node claim is (a) created by theprovisioner
controller and (b) gets generated with just a single type rather than the full expected set.After the node is created, Karpenter will then usually disrupt it shortly after and replace it with a smaller node. However, we have sometimes had PDBs prevent this which is when we noticed that this behavior was occurring.
Additionally, all of the NodePools where we have observed this behavior allow spot instances, but I do not know if that is relevant (all of our NodePools are spot-enabled).
Finally, we only started noticing this issue after upgrading to Karpenter v1 or at least it seems far more prevalent now.
Versions:
Chart Version:
1.0.1
Kubernetes Version (
kubectl version
):v1.29.8-eks-a737599
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment