Open runningman84 opened 1 year ago
100% agree on this. The more complicated the provisioner configuration and the # of provisioners you have configured these error messages are indecipherable for end users and close to it for admins.
Warning FailedScheduling 24m karpenter Failed to schedule pod, incompatible with provisioner "buildfarm-gpu", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, no instance type satisfied resources {"cpu":"61510m","ephemeral-storage":"1074Mi","memory":"99963Mi","pods":"12"} and requirements dedicated-node In [buildfarm], node-subtype In [gpu], node-type In [buildfarm], pricing-model In [on-demand], karpenter.k8s.aws/instance-category In [g], karpenter.k8s.aws/instance-encryption-in-transit-supported In [true], karpenter.k8s.aws/instance-generation In [4 5 6], karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [buildfarm-gpu], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [m5dn.16xlarge r5dn.16xlarge], node.kubernetes.io/lifecycle In [on-demand], topology.kubernetes.io/zone In [us-east-1b] (no instance type met all requirements); incompatible with provisioner "default", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, incompatible requirements, label "dedicated-node" does not have known values; incompatible with provisioner "system", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, did not tolerate node-type=system:NoSchedule; incompatible with provisioner "buildfarm", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, no instance type satisfied resources {"cpu":"61510m","ephemeral-storage":"1074Mi","memory":"99963Mi","pods":"12"} and requirements dedicated-node In [buildfarm], node-subtype In [cpu], node-type In [buildfarm], pricing-model In [on-demand], karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-encryption-in-transit-supported In [true], karpenter.k8s.aws/instance-generation Exists >4, karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [buildfarm], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [m5dn.16xlarge r5dn.16xlarge], node.kubernetes.io/lifecycle In [on-demand], topology.kubernetes.io/zone In [us-east-1b] (no instance type met the scheduling requirements or had enough resources)
The more complicated the provisioner configuration and the # of provisioners you have configured these error messages are indecipherable for end users and close to it for admins
There's definitely some work that we should do here. It gets a tad complicated since it's hard for us to know exactly which NodePool you intended to schedule to in the first place; so, we print out all the incompatibility for completeness.
@sidewinder12s @runningman84 Did y'all have any thoughts around how we could make this error message shorter and more targeted to help you discover the exact issue?
I still think that my original suggestion seams good: Could not schedule pod, incompatible with provisioner "arm " ..., incompatible with provisioner "x86" due to cpu limit (196 out of 200) In case a given provisioner does not fit die too limits just state that and ignore all other conditions for the given provisioner….
Ya either what @runningman84 said to keep the message consistent with how those messages are written out or even more explicit like; compatible with provisioner X but limit is reached. Though that might cause some confusion if you have overlapping provisioners.
/assign
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Description
What problem are you trying to solve? Error message does not show the real cause:
In this case it was a cpu resource limit:
How important is this feature to you? We had a lot of these issues lately and a better error message like this would help: