kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
614 stars 203 forks source link

Clearer documentation around terminationGracePeriod #1646

Closed andrewhibbert closed 3 weeks ago

andrewhibbert commented 2 months ago

Description

Observed Behavior:

https://github.com/kubernetes-sigs/karpenter/pull/916#issuecomment-2218750570

The TGP timeout applies to all forms of node disruption. It does not allow disruption to be initiated for consolidation if the node is blocked by a do-not-disrupt pod or a pod with a blocked PDB.

So it only applies for Empty and Drift?

Expected Behavior:

Clearer documentation

Reproduction Steps (Please include YAML):

Versions:

k8s-ci-robot commented 2 months ago

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
engedaam commented 1 month ago

That comment is only saying the nodes will not be disrupted until all pods that contain do-not-disrupt annotation are not on a disreputable node. What better documentation would you like to see? https://karpenter.sh/docs/concepts/disruption/#terminationgraceperiod

engedaam commented 1 month ago

/kind support /remove-kind bug

andrewhibbert commented 1 month ago

I was expecting the documentation to show the behaviour in regards to different forms of disruption and how do-not-disrupt annotations or a pod with a blocked PDB now affect each.

The github comment https://github.com/kubernetes-sigs/karpenter/pull/916#issuecomment-2218750570, seems to suggest that a node will not be drained if pods are blocked using do-not-disrupt or PDB, whereas Drift and Eviction would

I think it would be useful to guarantee that all our nodes will roll on a Drift despite blocking do-not-disrupt / PDBs but feel they should be respected for general consolidations or at least on a longer timeframe. So trying to figure out what it does before applying changes

andrewhibbert commented 1 month ago

From what I've seen the above is the case, I've seen tests saying "does not consolidate nodes with karpenter.sh/do-not-disrupt on pods when the NodePool's TerminationGracePeriod is not nil" and "does not consolidate nodes with pods with blocking PDBs when the NodePool's TerminationGracePeriod is not nil" and "should drift nodes that have pods with the karpenter.sh/do-not-disrupt annotation when the NodePool's TerminationGracePeriod is not nil" and also "should consider candidates that have do-not-disrupt pods scheduled with a terminationGracePeriod set for eventual disruption" and "should consider candidates that have PDB-blocked pods scheduled with a terminationGracePeriod set for eventual disruption". I believe that only drift is an eventual disruption https://github.com/kubernetes-sigs/karpenter/blob/140d35b0a4bef7e8914d3cec7efd9aa10945ad52/pkg/controllers/disruption/controller.go#L76

Tests I've seen:

does not consolidate nodes with karpenter.sh/do-not-disrupt on pods when the NodePool's TerminationGracePeriod is not nil

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/consolidation_test.go#L2527

does not consolidate nodes with pods with blocking PDBs when the NodePool's TerminationGracePeriod is not nil

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/consolidation_test.go#L2575

should drift nodes that have pods with the karpenter.sh/do-not-disrupt annotation when the NodePool's TerminationGracePeriod is not nil

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/eventual_test.go#L588

should not consider candidates that have do-not-disrupt pods scheduled and no terminationGracePeriod

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L817

should consider candidates that have do-not-disrupt pods scheduled with a terminationGracePeriod set for eventual disruption

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L922

should consider candidates that have PDB-blocked pods scheduled with a terminationGracePeriod set for eventual disruption

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L951

should not consider candidates that have do-not-disrupt pods scheduled with a terminationGracePeriod set for graceful disruption

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L983

should not consider candidates that have PDB-blocked pods scheduled with a terminationGracePeriod set for graceful disruption

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L1012

should not consider candidates that have do-not-disrupt pods scheduled without a terminationGracePeriod set for eventual disruption

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L1048

should not consider candidates that have PDB-blocked pods scheduled without a terminationGracePeriod set for eventual disruption

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L1076

should not consider candidates that have do-not-disrupt pods without a terminationGracePeriod set for graceful disruption

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L1331

should not consider candidates that have fully blocking PDBs without a terminationGracePeriod set for graceful disruption

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/suite_test.go#L1362

github-actions[bot] commented 1 month ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.