Improve error messages if resource limit is reached

runningman84 commented 1 year ago

Description

What problem are you trying to solve? Error message does not show the real cause:

karpenter/karpenter-9d8575fb-hmrf7[controller]: 2023-07-24T09:45:37.841Z    ERROR   controller.provisioner  Could not schedule pod, incompatible with provisioner "arm", did not tolerate arch=arm64:NoSchedule; incompatible with provisioner "x86", no instance type satisfied resources {"cpu":"16","memory":"32Gi","pods":"1"} and requirements karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-generation Exists >3, karpenter.k8s.aws/instance-hypervisor In [nitro], kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/arch In [amd64], karpenter.sh/provisioner-name In [x86]    {"commit": "dc3af1a", "pod": "canda/canda-avstock-7bbc89bb7b-8t9lk"}

In this case it was a cpu resource limit:

  limits:
    resources:
      cpu: "200"

How important is this feature to you? We had a lot of these issues lately and a better error message like this would help:

Could not schedule pod, incompatible with provisioner "arm " ..., incompatible with provisioner "x86" due to cpu limit (196 out of 200)

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

sidewinder12s commented 1 year ago

100% agree on this. The more complicated the provisioner configuration and the # of provisioners you have configured these error messages are indecipherable for end users and close to it for admins.

  Warning  FailedScheduling  24m                  karpenter          Failed to schedule pod, incompatible with provisioner "buildfarm-gpu", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, no instance type satisfied resources {"cpu":"61510m","ephemeral-storage":"1074Mi","memory":"99963Mi","pods":"12"} and requirements dedicated-node In [buildfarm], node-subtype In [gpu], node-type In [buildfarm], pricing-model In [on-demand], karpenter.k8s.aws/instance-category In [g], karpenter.k8s.aws/instance-encryption-in-transit-supported In [true], karpenter.k8s.aws/instance-generation In [4 5 6], karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [buildfarm-gpu], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [m5dn.16xlarge r5dn.16xlarge], node.kubernetes.io/lifecycle In [on-demand], topology.kubernetes.io/zone In [us-east-1b] (no instance type met all requirements); incompatible with provisioner "default", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, incompatible requirements, label "dedicated-node" does not have known values; incompatible with provisioner "system", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, did not tolerate node-type=system:NoSchedule; incompatible with provisioner "buildfarm", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, no instance type satisfied resources {"cpu":"61510m","ephemeral-storage":"1074Mi","memory":"99963Mi","pods":"12"} and requirements dedicated-node In [buildfarm], node-subtype In [cpu], node-type In [buildfarm], pricing-model In [on-demand], karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-encryption-in-transit-supported In [true], karpenter.k8s.aws/instance-generation Exists >4, karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [buildfarm], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [m5dn.16xlarge r5dn.16xlarge], node.kubernetes.io/lifecycle In [on-demand], topology.kubernetes.io/zone In [us-east-1b] (no instance type met the scheduling requirements or had enough resources)

jonathan-innis commented 10 months ago

The more complicated the provisioner configuration and the # of provisioners you have configured these error messages are indecipherable for end users and close to it for admins

There's definitely some work that we should do here. It gets a tad complicated since it's hard for us to know exactly which NodePool you intended to schedule to in the first place; so, we print out all the incompatibility for completeness.

@sidewinder12s @runningman84 Did y'all have any thoughts around how we could make this error message shorter and more targeted to help you discover the exact issue?

runningman84 commented 10 months ago

I still think that my original suggestion seams good: Could not schedule pod, incompatible with provisioner "arm " ..., incompatible with provisioner "x86" due to cpu limit (196 out of 200) In case a given provisioner does not fit die too limits just state that and ignore all other conditions for the given provisioner….

sidewinder12s commented 10 months ago

Ya either what @runningman84 said to keep the message consistent with how those messages are written out or even more explicit like; compatible with provisioner X but limit is reached. Though that might cause some confusion if you have overlapping provisioners.

sadath-12 commented 9 months ago

/assign

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s commented 6 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 days ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s commented 3 days ago

/remove-lifecycle stale

kubernetes-sigs / karpenter

Improve error messages if resource limit is reached #686

Description