Closed garvinp-stripe closed 6 days ago
Logging:
Could not schedule pod, incompatible with ...
log structure which grows with the number of node pools. Because this is just a message it gets pretty out of hand pretty quickly. Would be good to restructure this such that is easier to read. level: DEBUG
logger: controller.disruption
message: discovered subnets
subnets:...
unable to resolve instance types, resolving node class,.....
which make sense but makes a lot of noise. If we can aggregate or run preflight checks in a singleton that might help? (this likely requires larger structural changes)Metrics:
Adding another example:
level: ERROR
logger: controller.nodeclaim.consistency
message: check failed, can't drain node, PDB {PDB} is blocking evictions
I understand that there are many controllers doing different things but when it comes debugging why something is blocking draining its pretty confusing find the logs in a consistency controller. Which also doesn't have consistent fields such as controller and controllerKind.
level: INFO
logger: controller.disruption
message: disrupting via consolidation delete, terminating 1 candidates {instance}
time: 2024-02-27T21:42:23.440Z
This is not actually correct, this wasn't due to a consolidation delete. Karpenter didn't initiate this node's deletion, the node failed its own health check and was deleted by another controller. This log is super confusing because it appears that the host was consolidated when in reality it wasn't
Adding another example:
level: ERROR logger: controller.nodeclaim.consistency message: check failed, can't drain node, PDB {PDB} is blocking evictions
I understand that there are many controllers doing different things but when it comes debugging why something is blocking draining its pretty confusing find the logs in a consistency controller. Which also doesn't have consistent fields such as controller and controllerKind.
We're facing the same issue, finding blocking pdb in logs and events is not straightforward. It would be good if this is available as a metric
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
Description
What problem are you trying to solve? I wanted to create an issue to track and collect Karpenter logging and metrics pain points from the community. I think it would be good to see where Karpenter's metric and logging falls short
How important is this feature to you?