Karpenter logging and metric suggestion

garvinp-stripe commented 6 months ago

Description

What problem are you trying to solve? I wanted to create an issue to track and collect Karpenter logging and metrics pain points from the community. I think it would be good to see where Karpenter's metric and logging falls short

How important is this feature to you?

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

garvinp-stripe commented 6 months ago

Logging:

Difficulty in tracking what happened to a node across its life time.
- Node name/ ID and/or node claim id aren't consistently used through out all actions taken against a node. It makes it very hard to know what to search for and how to get all relevant information in a single query. Things like node is disrupted has node name but things like node can't be disrupted because of PDB has node claim
Improving Could not schedule pod, incompatible with ... log structure which grows with the number of node pools. Because this is just a message it gets pretty out of hand pretty quickly. Would be good to restructure this such that is easier to read.
Inconsistent fields across controllers. For example ControllerKind, controller, controllergroup exist for some controllers but not all? I see disruption controller has logs that just has none of these things. This makes it hard to filter or query against a field that you think exist for all controllers
```
level: DEBUG
logger: controller.disruption
message: discovered subnets
subnets:...
```
Aggregating error if possible when issue stems from the same origin. For example, a NodeClass is missing and a NodePool references it. When that happens both the provisioner and deprovisioner controller will error with unable to resolve instance types, resolving node class,..... which make sense but makes a lot of noise. If we can aggregate or run preflight checks in a singleton that might help? (this likely requires larger structural changes)

Metrics:

Once again deduping errors that all controllers hit but stems from the same issue would be nice. For example, if we hit an Insufficient capacity issue during node launch, we get errors in other controllers like disruption and garbage collection controllers. This just makes a lot of noise.

garvinp-stripe commented 6 months ago

Adding another example:

 level: ERROR
   logger: controller.nodeclaim.consistency
   message: check failed, can't drain node, PDB {PDB} is blocking evictions

I understand that there are many controllers doing different things but when it comes debugging why something is blocking draining its pretty confusing find the logs in a consistency controller. Which also doesn't have consistent fields such as controller and controllerKind.

garvinp-stripe commented 5 months ago

level: INFO
   logger: controller.disruption
   message: disrupting via consolidation delete, terminating 1 candidates {instance}
   time: 2024-02-27T21:42:23.440Z

This is not actually correct, this wasn't due to a consolidation delete. Karpenter didn't initiate this node's deletion, the node failed its own health check and was deleted by another controller. This log is super confusing because it appears that the host was consolidated when in reality it wasn't

rkilingr commented 5 months ago

Adding another example:
 level: ERROR
   logger: controller.nodeclaim.consistency
   message: check failed, can't drain node, PDB {PDB} is blocking evictions
I understand that there are many controllers doing different things but when it comes debugging why something is blocking draining its pretty confusing find the logs in a consistency controller. Which also doesn't have consistent fields such as controller and controllerKind.

We're facing the same issue, finding blocking pdb in logs and events is not straightforward. It would be good if this is available as a metric

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 6 days ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 6 days ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/karpenter/issues/1042#issuecomment-2308767640): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / karpenter

Karpenter logging and metric suggestion #1042

Description