Open sidewinder12s opened 1 year ago
Note: provisioners are being changed to to "nodepools" - see: https://karpenter.sh/docs/upgrading/v1beta1-migration/#provisioner---nodepool
@sidewinder12s it seems that we could create a metric on limits being reached whenever this limit function returns an error: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/apis/v1beta1/nodepool.go
Of course that would mean that you would have to have at least one occasion where a node has failed to be launched before you would actually have any metric produced which you could use to alert on - is this what you were thinking of?
Additionally, would it be particularly helpful for there to be more specific metrics on the exact limit that was reached? e.g. counts of memory limit threshold exceeded for node group, alongside an all-encompassing count of "limit exceeded" metric for a given node pool
I'd love to pop a status condition for this, and then generate metrics and events from the status condition.
@ellistarn sounds interesting - presumably you could use this new framework to handle the related issue (https://github.com/kubernetes-sigs/karpenter/issues/705) too? Looking over it briefly it seems it could work well here
Yes. I'd like to see what about karpenter's operations can't be modeled in this way. I'm hopeful that we can cover almost everything with this mechanism.
Note: provisioners are being changed to to "nodepools" - see: https://karpenter.sh/docs/upgrading/v1beta1-migration/#provisioner---nodepool
@sidewinder12s it seems that we could create a metric on limits being reached whenever this limit function returns an error: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/apis/v1beta1/nodepool.go
Of course that would mean that you would have to have at least one occasion where a node has failed to be launched before you would actually have any metric produced which you could use to alert on - is this what you were thinking of?
Additionally, would it be particularly helpful for there to be more specific metrics on the exact limit that was reached? e.g. counts of memory limit threshold exceeded for node group, alongside an all-encompassing count of "limit exceeded" metric for a given node pool
Yes, I was hoping to have an easier to consume metric produced by Karpenter saying I am being blocked from launching more capacity.
A lot of the suggestions around how to alert on a condition like this currently (unschedulable pod for example) rely on other systems to produce the metrics, which:
I think having it broken out by limit breached could be useful, though at scale the potential cardinality of those metrics might be of concern.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
We have dozens of clusters and thousands of nodes. Such an easy metric "on reaching limits", would make life so much easier. <3
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Description
What problem are you trying to solve?
I need a clear way to alert when provisioners hit their limits and will no longer provision capacity. We've tried setting some metric alerts based on % of the limit provisioned but this has turned out to be tricky.
The latter problem is even worse if you have a diverse set of workloads of mixed sizes. Some of our clusters we can slowly creep up to 90+ percent utilization with lots of smaller workloads so the alerts work as intended, other times we'll have large pods come in that can't provision capacity and will be blocked.
I think it'd be easier/simpler to operate if there was just another metric per provisioner that just said I am currently limited, in addition to metric that currently exists.
Another option might be to just alert based on pods failing to schedule, but that will likely produce its own issues with the multitude of ways that pods can fail to schedule.
How important is this feature to you?
Fairly important operational consideration.
Related issue around how hard it is to tell if you've hit the provisioner limit: https://github.com/aws/karpenter-core/issues/686