Karpenter Operational Monitoring

martinsmatthews commented 1 month ago

Description

What problem are you trying to solve?

We've been running Karpenter in production for the last year and while it is very stable and saving us a bunch of money, it is very hard to distinguish between expected behaviour, degradation and flat failure with the metrics that we get out of Karpenter at the moment.

Some example use cases:

Weighted nodepools to prefer a certain instance type

We use weighted nodepools for our gameservers as we prefer C7i to C6i as we save money as they can be run "hotter", i.e. with higher CPU utilization, without framerate degration so we can run fewer of them. In order to get Karpenter to pick instances types that are on paper more expensive, the only current option that we are aware of is to do this. When Karpenter tries to spin up a C7i (or any other instance type) and there is no availability it will destroy the nodeclaim, and increments the karpenter_nodeclaims_terminated metric for label insufficient_capacity and then creates a new node claim for the lower weighted instance type, an instance spins up and all is well.

Failure case - EC2 quota limits hit for a region

When we can no longer spin up nodes in a region as we've hit out CPU or EBS quota, when Karpenter tries to spin up an instance it gets an error about hitting the quota limit, it then destroys the nodeclaim, and increments the karpenter_nodeclaims_terminated metric for label insufficient_capacity and then creates a new node claim and hits the same issue, repeat.

Failure case - very low availability of instances types

We have seen, e.g. in apse2 az-2a, very low availability of C and M class nodes in gen 6 and 7, this also manifests very similarly to above where we see a lot of node claims terminated for insufficient_capacity but this can be hidden as we see insufficient_capacity for the other zones in the same cluster as we may be falling back to available nodes in that zone.

Liveness - degradation vs flat failure

We see similar issues with liveness as some of our clusters run in localzones which can take a while to spin up nodes, especially SAE1 which can take 10+ mins to run the standard eks bootstrap due to rtt to US east 1 and back. In this we frequently see node claims terminated for liveness issues and then the next nodeclaim comes up just fine if a little slow. This is hard to distguish, especially during node scale up as we move into peak usage, between this and a total failure due to an issue creating EC2 instances i.e. an AWS outage with the EC2 api, which has hit us in the past.

What we would like

There may already be a way to do this with the current metrics, but there's nothing that we've found so far. What we'd like are very clear metrics, or a very clear method using the existing metrics, that indicates when there is an actual problem creating nodes and also whether it is a degradation vs a total failure.

So failing back through weighted nodepools would not be an error, but finding no more appropriate nodepools having trying all weighted/matching ones would be an error. A single liveness failure would not be an error, multiple would. A lack of availablity of a certain instance type would no be an error, hitting a quota limit would be etc.

For example, we could have something like a metric to indicate that Karpenter needs to add a node for provisioning (i.e. deployments are scaling up, more nodes are required), possibly with some sort of uuid and then tie all the node claims created to fullfill that provisioning need with that uuid so we can clearly see that in a certain cluster/region/zone/whatever is taking X attempts to fulfil a need to provision a new node.

Also some documentation improvements would be really useful, such as a documentation section on how monitor Karpenter in production would be super useful as there are number of threads in the Karpenter slack asking this question. And it would be great if there were some more detailed explanations of the metrics e.g. from the current docs

karpenter_provisioner_scheduling_duration_seconds Duration of scheduling process in seconds.

This doesn't really explain any more than the name of the metric already does so it doesn't really help a Karpenter user understand how to use this metric for monitoring.

I'm happy to discuss this in Slack and help out with designing a potential solution/options etc.

How important is this feature to you?

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

martinsmatthews commented 1 month ago

@jmdeal we discussed this a month or so back - sorry, it took me a while to get this written down - had to get hit by an issue in APSE2 to really narrow my focus on what we wanted/what we saw as the current gaps!

kubernetes-sigs / karpenter

Karpenter Operational Monitoring #1692

Description