kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
526 stars 173 forks source link

Karpenter logging and metric suggestion #1042

Closed garvinp-stripe closed 6 days ago

garvinp-stripe commented 6 months ago

Description

What problem are you trying to solve? I wanted to create an issue to track and collect Karpenter logging and metrics pain points from the community. I think it would be good to see where Karpenter's metric and logging falls short

How important is this feature to you?

garvinp-stripe commented 6 months ago

Logging:

Metrics:

garvinp-stripe commented 6 months ago

Adding another example:

 level: ERROR
   logger: controller.nodeclaim.consistency
   message: check failed, can't drain node, PDB {PDB} is blocking evictions

I understand that there are many controllers doing different things but when it comes debugging why something is blocking draining its pretty confusing find the logs in a consistency controller. Which also doesn't have consistent fields such as controller and controllerKind.

garvinp-stripe commented 5 months ago
level: INFO
   logger: controller.disruption
   message: disrupting via consolidation delete, terminating 1 candidates {instance}
   time: 2024-02-27T21:42:23.440Z

This is not actually correct, this wasn't due to a consolidation delete. Karpenter didn't initiate this node's deletion, the node failed its own health check and was deleted by another controller. This log is super confusing because it appears that the host was consolidated when in reality it wasn't

rkilingr commented 5 months ago

Adding another example:

 level: ERROR
   logger: controller.nodeclaim.consistency
   message: check failed, can't drain node, PDB {PDB} is blocking evictions

I understand that there are many controllers doing different things but when it comes debugging why something is blocking draining its pretty confusing find the logs in a consistency controller. Which also doesn't have consistent fields such as controller and controllerKind.

We're facing the same issue, finding blocking pdb in logs and events is not straightforward. It would be good if this is available as a metric

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 6 days ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 6 days ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/karpenter/issues/1042#issuecomment-2308767640): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.