cordon() doesn't return any error even when the cordon opration failed

Is this a BUG REPORT or FEATURE REQUEST?:

BUG REPORT

What happened:

lifecycle-manager calls kubectl lib's RunCordonOrUncordon()func. here is the link to the kubectl code

node cordon failed due to api-server request timeout.

time="2023-11-14T04:40:15Z" level=error msg="failed to drain node ip-10-184-115-74.us-west-2.compute.internal  
error: error cordoning node: cordon error: 
Patch \"https://172.20.0.1:443/api/v1/nodes/ip-10-184-115-74.us-west-2.compute.internal\": dial tcp 172.20.0.1:443: i/o timeout "

internally, kubectl lib mark the the node obj's Spec.Unschedulable to true

lifecycle-manager retry the cordon func, but with the existing node obj pointer, which already has node.Spec.Unschedulable = true
in this retry attempt, kubectl lib will see the Unschedulable already set to true, so it thinks there is no need to cordon this node and return nil directly.
lifecycle-manager thinks the cordon succeed, and starts the drain the node, even though the node is actually still scheduable.
old pods get evicted, lifecycle-manaer starts to tell ASG to terminate the node. At the same time, new pods get scheduled to this node
node terminated, and new pods are gone...

What you expected to happen:

lifecycle-manager should make sure the cordon operation succeed before drain the node.

How to fix this issue:

create a copy of the node obj and pass the copy to the drainNodeUtil() func

keikoproj / lifecycle-manager

cordon() doesn't return any error even when the cordon opration failed #164