Karpenter should calculate DaemonSet overhead correctly when using `karpenter.sh/initialized` and `karpenter.sh/registered`

jonathan-innis commented 8 months ago

Description

Observed Behavior:

This issue was first reported in https://github.com/aws/karpenter-provider-aws/issues/5406 (see that issue for more context). Karpenter was not calculating DaemonSet overhead correctly when a user was specifying node selection/node affinity logic for DaemonSets that referenced karpenter.sh/initialized and karpenter.sh/registered. The karpenter.sh/registered label value is added to the node once the Node has properly joined the cluster and the karpenter.sh/initialized label value is added to the Node once Karpenter deems the node to have fully initialized.

Currently, there is no logic in daemonset calculations that recognizes that that these labels will eventually be added to the node and should therefore be considered if there is node selection logic against them. The lack of this logic can lead to us undershooting the resources that DaemonSets and applications need to schedule.

Expected Behavior:

Karpenter should assume that DaemonSets that select on karpenter.sh/registered and karpenter.sh/initialized will schedule against Karpenter nodes.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jonathan-innis commented 8 months ago

We should also check to make sure instance type labels work here as well, like if someone does an affinity for their DaemonSet nodes like

matchExpressions:
- key: node.kubernetes.io/instance-type
  operator: Exists

Do we consider this correctly or do we not know this label exists since we are just using the NodePool representation to track DaemonSet overhead?

sadath-12 commented 7 months ago

/assign

sadath-12 commented 7 months ago

@jonathan-innis . Should we really handle it . since those labels are meant for internal dev purpose and daemonsets are to be scheduled on every node . so there might not be a case where daemonsets should be scheduled on karpenter managed nodes and not on other nodes right ?

jonathan-innis commented 7 months ago

since those labels are meant for internal dev purpose and daemonsets are to be scheduled on every node

You can create node selector requirements or node affinities to have DaemonSets only on certain nodes. The use-case that we saw was someone that was running an EKS cluster that had mostly Karpenter-managed capacity but was running a small amount of capacity (including Karpenter itself) on fargate nodes. They didn't want to run the daemonset pods on Fargate nodes, so they create selection logic to only schedule the DaemonSet pods against the Karpenter-provisioned pods.

I think we should try to be correct if we can here to the best of our ability. Since we know that these labels are going to eventually appear on the node, we should schedule with DaemonSet pods assuming that that is going to be the case.

sadath-12 commented 7 months ago

Then in that case we use other label on daemonsets or any other pods to be scheduled on karpenter managed nodes instead of conditioning the labels that was for the dev purpose ? Since like this any user can use any label and we might need to handle all case . instead we could take a preventive measure to mention how to achieve the goal that user was intending exactly ? Correct me if I'm wrong thanks 😊

jonathan-innis commented 7 months ago

instead we could take a preventive measure to mention how to achieve the goal that user was intending exactly

Agree, I think it makes sense not to recommend users to go this route to select against Karpenter-managed nodes.

conditioning the labels that was for the dev purpose

I'd argue that they aren't for a dev purpose and that we should be correct if there isn't a large technical burden for us to be correct here. In this case, we just need to expect that these labels will be added to any nodes launched by NodePools when we are calculating the DaemonSet overhead.

jonathan-innis commented 3 months ago

/unassign @sadath-12

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-sigs / karpenter

Karpenter should calculate DaemonSet overhead correctly when using `karpenter.sh/initialized` and `karpenter.sh/registered` #919

Description