keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
94 stars 28 forks source link

cordon() doesn't return any error even when the cordon opration failed #164

Closed ZihanJiang96 closed 9 months ago

ZihanJiang96 commented 10 months ago

Is this a BUG REPORT or FEATURE REQUEST?:

BUG REPORT

What happened:

  1. lifecycle-manager calls kubectl lib's RunCordonOrUncordon()func. here is the link to the kubectl code
  2. node cordon failed due to api-server request timeout.
    time="2023-11-14T04:40:15Z" level=error msg="failed to drain node ip-10-184-115-74.us-west-2.compute.internal  
    error: error cordoning node: cordon error: 
    Patch \"https://172.20.0.1:443/api/v1/nodes/ip-10-184-115-74.us-west-2.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout "

    internally, kubectl lib mark the the node obj's Spec.Unschedulable to true

  3. lifecycle-manager retry the cordon func, but with the existing node obj pointer, which already has node.Spec.Unschedulable = true
  4. in this retry attempt, kubectl lib will see the Unschedulable already set to true, so it thinks there is no need to cordon this node and return nil directly.
  5. lifecycle-manager thinks the cordon succeed, and starts the drain the node, even though the node is actually still scheduable.
  6. old pods get evicted, lifecycle-manaer starts to tell ASG to terminate the node. At the same time, new pods get scheduled to this node
  7. node terminated, and new pods are gone...

What you expected to happen:

lifecycle-manager should make sure the cordon operation succeed before drain the node.

How to fix this issue:

create a copy of the node obj and pass the copy to the drainNodeUtil() func

karimlakhani commented 9 months ago

Nice find!