keikoproj / upgrade-manager

Reliable, extensible rolling-upgrades of Autoscaling groups in Kubernetes
Apache License 2.0
140 stars 45 forks source link

strategy.drainTimeout not working as intended? #346

Open jess-belliveau opened 1 year ago

jess-belliveau commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened: I am setting strategy.drainTimeout to 1000 seconds but I see the node immediately terminated after the node drain is issued.

What you expected to happen: I expect upgrade-manager to wait 1000 seconds after the drain is issued before terminating the instance.

How to reproduce it (as minimally and precisely as possible):

➜ cat ru-drain.yml
apiVersion: upgrademgr.keikoproj.io/v1alpha1
kind: RollingUpgrade
metadata:
  annotations:
    app.kubernetes.io/managed-by: instance-manager
    instancemgr.keikoproj.io/upgrade-scope: <snip>-instance-manager-platform-apm-us-west-2a
  name: platform-apm-us-west-2a-20220715002858-19
  namespace: instance-manager
spec:
  asgName: <snip>-instance-manager-platform-apm-us-west-2a
  forceRefresh: true
  nodeIntervalSeconds: 10
  postDrain:
    waitSeconds: 300
  postDrainDelaySeconds: 45
  strategy:
    drainTimeout: 1000      <- this is the field I'm setting
    maxUnavailable: 1
    mode: eager

Anything else we need to know?: Am I interpreting the spec correctly?

Environment:

Other debugging information (if applicable):

jess-belliveau commented 1 year ago

Ah, I should have mentioned - the problem we are facing is the pods are still in terminating state when the underlying node is terminated. We are trying to configure the RU to have a wait to allow the pods to gracefully terminate as part of the drain.

shreyas-badiger commented 1 year ago

@jess-belliveau configuring drainTimeout is exactly what the name suggests. If the node drain doesn't complete within the drainTimeout value, then the rolling upgrade (RU) is marked as failed. So, if the drain command is completed within a second, the node is terminated right after.

If I understand it correctly, you are trying to delay the node termination. You should consider using the postDrain in spec where you can specify the wait before the termination is initiated.

jess-belliveau commented 1 year ago

@shreyas-badiger , thanks for response.

If you look at my spec at the start, I actually have set a postDrain.waitSeconds:

  postDrain:
    waitSeconds: 300

I hadn't even realised this field doesn't appear to work either. I'm not seeing a 300 second pause anywhere.

shreyas-badiger commented 1 year ago

@jess-belliveau I think the implementation for the postDrain.waitSeconds is missing. If you have some bandwidth, can you contribute? If not, I think you can use the postDrain script. https://github.com/keikoproj/upgrade-manager/blob/79b38c0290c72c6aa1e1d1a89a9d0b325ee2473b/controllers/script_runner.go#L109

jess-belliveau commented 1 year ago

Thanks @shreyas-badiger - I might be able to loop back in the future and see what contributions I can make.

For the time being, we are having promising results with;

"postDrain":
  "script": |
    count=10; while [ $count -gt 0 ]; do count=`kubectl get pods -A --field-selector spec.nodeName=$INSTANCE_NAME -o jsonpath='{range .items[?(.metadata.ownerReferences[*].kind!="DaemonSet")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | wc -l`; echo "$count pods draining"; sleep 10; done

the only caveat being, we have had to add some binaries to the rolling-upgrade-controller image - kubectl, wc and sleep.