Open jess-belliveau opened 1 year ago
Ah, I should have mentioned - the problem we are facing is the pods are still in terminating state when the underlying node is terminated. We are trying to configure the RU to have a wait to allow the pods to gracefully terminate as part of the drain.
@jess-belliveau configuring drainTimeout
is exactly what the name suggests. If the node drain doesn't complete within the drainTimeout
value, then the rolling upgrade (RU) is marked as failed. So, if the drain command is completed within a second, the node is terminated right after.
If I understand it correctly, you are trying to delay the node termination. You should consider using the postDrain in spec where you can specify the wait before the termination is initiated.
@shreyas-badiger , thanks for response.
If you look at my spec at the start, I actually have set a postDrain.waitSeconds:
postDrain:
waitSeconds: 300
I hadn't even realised this field doesn't appear to work either. I'm not seeing a 300 second pause anywhere.
@jess-belliveau I think the implementation for the postDrain.waitSeconds
is missing.
If you have some bandwidth, can you contribute?
If not, I think you can use the postDrain script.
https://github.com/keikoproj/upgrade-manager/blob/79b38c0290c72c6aa1e1d1a89a9d0b325ee2473b/controllers/script_runner.go#L109
Thanks @shreyas-badiger - I might be able to loop back in the future and see what contributions I can make.
For the time being, we are having promising results with;
"postDrain":
"script": |
count=10; while [ $count -gt 0 ]; do count=`kubectl get pods -A --field-selector spec.nodeName=$INSTANCE_NAME -o jsonpath='{range .items[?(.metadata.ownerReferences[*].kind!="DaemonSet")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | wc -l`; echo "$count pods draining"; sleep 10; done
the only caveat being, we have had to add some binaries to the rolling-upgrade-controller
image - kubectl, wc and sleep.
Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT
What happened: I am setting strategy.drainTimeout to 1000 seconds but I see the node immediately terminated after the node drain is issued.
What you expected to happen: I expect upgrade-manager to wait 1000 seconds after the drain is issued before terminating the instance.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: Am I interpreting the spec correctly?
Environment:
Other debugging information (if applicable):