Improved Provisioner Limit Metric/Alerting

sidewinder12s commented 1 year ago

Description

What problem are you trying to solve?

I need a clear way to alert when provisioners hit their limits and will no longer provision capacity. We've tried setting some metric alerts based on % of the limit provisioned but this has turned out to be tricky.

Tried setting an alert for 100% utilization. This did not work as we'd almost never hit 100% because of the sizes of instances would never match up exactly with the limit
Tried setting an alert for 90% utilization. This has also failed in a couple cases where new workloads came in requiring significant capacity and when the provisioner was already at 80% it could not provision more capacity.

The latter problem is even worse if you have a diverse set of workloads of mixed sizes. Some of our clusters we can slowly creep up to 90+ percent utilization with lots of smaller workloads so the alerts work as intended, other times we'll have large pods come in that can't provision capacity and will be blocked.

I think it'd be easier/simpler to operate if there was just another metric per provisioner that just said I am currently limited, in addition to metric that currently exists.

Another option might be to just alert based on pods failing to schedule, but that will likely produce its own issues with the multitude of ways that pods can fail to schedule.

How important is this feature to you?

Fairly important operational consideration.

Related issue around how hard it is to tell if you've hit the provisioner limit: https://github.com/aws/karpenter-core/issues/686

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

gfcroft commented 10 months ago

Note: provisioners are being changed to to "nodepools" - see: https://karpenter.sh/docs/upgrading/v1beta1-migration/#provisioner---nodepool

@sidewinder12s it seems that we could create a metric on limits being reached whenever this limit function returns an error: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/apis/v1beta1/nodepool.go

Of course that would mean that you would have to have at least one occasion where a node has failed to be launched before you would actually have any metric produced which you could use to alert on - is this what you were thinking of?

Additionally, would it be particularly helpful for there to be more specific metrics on the exact limit that was reached? e.g. counts of memory limit threshold exceeded for node group, alongside an all-encompassing count of "limit exceeded" metric for a given node pool

ellistarn commented 10 months ago

I'd love to pop a status condition for this, and then generate metrics and events from the status condition.

Prototype: https://github.com/awslabs/operatorpkg/pull/7

gfcroft commented 10 months ago

@ellistarn sounds interesting - presumably you could use this new framework to handle the related issue (https://github.com/kubernetes-sigs/karpenter/issues/705) too? Looking over it briefly it seems it could work well here

ellistarn commented 10 months ago

Yes. I'd like to see what about karpenter's operations can't be modeled in this way. I'm hopeful that we can cover almost everything with this mechanism.

sidewinder12s commented 10 months ago

Note: provisioners are being changed to to "nodepools" - see: https://karpenter.sh/docs/upgrading/v1beta1-migration/#provisioner---nodepool

@sidewinder12s it seems that we could create a metric on limits being reached whenever this limit function returns an error: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/apis/v1beta1/nodepool.go

Of course that would mean that you would have to have at least one occasion where a node has failed to be launched before you would actually have any metric produced which you could use to alert on - is this what you were thinking of?

Additionally, would it be particularly helpful for there to be more specific metrics on the exact limit that was reached? e.g. counts of memory limit threshold exceeded for node group, alongside an all-encompassing count of "limit exceeded" metric for a given node pool

Yes, I was hoping to have an easier to consume metric produced by Karpenter saying I am being blocked from launching more capacity.

A lot of the suggestions around how to alert on a condition like this currently (unschedulable pod for example) rely on other systems to produce the metrics, which:

Does not work for that metric for EKS since EKS does not expose scheduler metrics
Karpenter is not the only cause of unschedulable errors

I think having it broken out by limit breached could be useful, though at scale the potential cardinality of those metrics might be of concern.

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

den-is commented 6 months ago

We have dozens of clusters and thousands of nodes. Such an easy metric "on reaching limits", would make life so much easier. <3

sidewinder12s commented 6 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s commented 4 weeks ago

/remove-lifecycle stale

kubernetes-sigs / karpenter

Improved Provisioner Limit Metric/Alerting #676

Description