strategy.drainTimeout not working as intended?

jess-belliveau commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened: I am setting strategy.drainTimeout to 1000 seconds but I see the node immediately terminated after the node drain is issued.

What you expected to happen: I expect upgrade-manager to wait 1000 seconds after the drain is issued before terminating the instance.

How to reproduce it (as minimally and precisely as possible):

➜ cat ru-drain.yml
apiVersion: upgrademgr.keikoproj.io/v1alpha1
kind: RollingUpgrade
metadata:
  annotations:
    app.kubernetes.io/managed-by: instance-manager
    instancemgr.keikoproj.io/upgrade-scope: <snip>-instance-manager-platform-apm-us-west-2a
  name: platform-apm-us-west-2a-20220715002858-19
  namespace: instance-manager
spec:
  asgName: <snip>-instance-manager-platform-apm-us-west-2a
  forceRefresh: true
  nodeIntervalSeconds: 10
  postDrain:
    waitSeconds: 300
  postDrainDelaySeconds: 45
  strategy:
    drainTimeout: 1000      <- this is the field I'm setting
    maxUnavailable: 1
    mode: eager

Anything else we need to know?: Am I interpreting the spec correctly?

Environment:

rolling-upgrade-controller version: v1.0.6

Kubernetes version :

$ kubectl version -o yaml
serverVersion:
buildDate: "2022-10-24T20:32:54Z"
compiler: gc
gitCommit: b07006b2e59857b13fe5057a956e86225f0e82b7
gitTreeState: clean
gitVersion: v1.21.14-eks-fb459a0
goVersion: go1.16.15
major: "1"
minor: 21+
platform: linux/amd64

Other debugging information (if applicable):

RollingUpgrade status:

➜ kd rollingupgrades platform-apm-us-west-2a-20220715002858-20 -n instance-manager
Name:         platform-apm-us-west-2a-20220715002858-20
Namespace:    instance-manager
Labels:       <none>
Annotations:  app.kubernetes.io/managed-by: instance-manager
          instancemgr.keikoproj.io/upgrade-scope: snip-instance-manager-platform-apm-us-west-2a
API Version:  upgrademgr.keikoproj.io/v1alpha1
Kind:         RollingUpgrade
Metadata:
Creation Timestamp:  2022-11-18T05:41:11Z
Generation:          1
Managed Fields:
API Version:  upgrademgr.keikoproj.io/v1alpha1
Fields Type:  FieldsV1
fieldsV1:
  f:metadata:
    f:annotations:
      .:
      f:app.kubernetes.io/managed-by:
      f:instancemgr.keikoproj.io/upgrade-scope:
      f:kubectl.kubernetes.io/last-applied-configuration:
  f:spec:
    .:
    f:asgName:
    f:forceRefresh:
    f:nodeIntervalSeconds:
    f:postDrain:
      .:
      f:waitSeconds:
    f:postDrainDelaySeconds:
    f:strategy:
      .:
      f:drainTimeout:
      f:maxUnavailable:
      f:mode:
Manager:      kubectl-client-side-apply
Operation:    Update
Time:         2022-11-18T05:41:11Z
API Version:  upgrademgr.keikoproj.io/v1alpha1
Fields Type:  FieldsV1
fieldsV1:
  f:status:
    .:
    f:completePercentage:
    f:currentStatus:
    f:endTime:
    f:lastDrainTime:
    f:lastTerminationTime:
    f:nodesProcessed:
    f:startTime:
    f:statistics:
    f:totalNodes:
    f:totalProcessingTime:
Manager:         manager
Operation:       Update
Time:            2022-11-18T05:43:18Z
Resource Version:  228511895
UID:               2eebdb9d-f8d8-4688-8985-7d713d9245f2
Spec:
Asg Name:               snip-instance-manager-platform-apm-us-west-2a
Force Refresh:          true
Node Interval Seconds:  10
Post Drain:
Wait Seconds:            300
Post Drain Delay Seconds:  45
Strategy:
Drain Timeout:    1000
Max Unavailable:  1
Mode:             eager
Status:
Complete Percentage:    100%
Current Status:         completed
End Time:               2022-11-18T05:43:18Z
Last Drain Time:        2022-11-18T05:43:16Z
Last Termination Time:  2022-11-18T05:43:16Z
Nodes Processed:        1
Start Time:             2022-11-18T05:41:11Z
Statistics:
Duration Count:       1
Duration Sum:         2.545233409s
Step Name:            kickoff
Duration Count:       1
Duration Sum:         2m1.447352312s
Step Name:            desired_node_ready
Duration Count:       1
Duration Sum:         41.598µs
Step Name:            predrain_script
Duration Count:       1
Duration Sum:         180.544516ms
Step Name:            drain
Duration Count:       1
Duration Sum:         6.235µs
Step Name:            postdrain_script
Duration Count:       1
Duration Sum:         54.047µs
Step Name:            post_wait
Duration Count:       1
Duration Sum:         225.887155ms
Step Name:            terminate
Duration Count:       1
Duration Sum:         4.774µs
Step Name:            post_terminate
Duration Count:       1
Duration Sum:         9.999999708s
Step Name:            terminated
Duration Count:       1
Duration Sum:         2m13.853890401s
Step Name:            total
Total Nodes:            1
Total Processing Time:  2m7s
Events:                   <none>

controller logs:

upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:14.771Z    INFO    controllers.RollingUpgrade  ***Reconciling***
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:14.771Z    INFO    controllers.RollingUpgrade  operating on existing rolling upgrade   {"scalingGroup": "snip-instance-manager-platform-apm-us-west-2a", "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":1000}, "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  scaling group details   {"scalingGroup": "snip-instance-manager-platform-apm-us-west-2a", "desiredInstances": 1, "launchConfig": "", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  checking if rolling upgrade is completed    {"name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  rolling upgrade configured for forced refresh   {"instance": "i-0bbb077b2dab36ac5", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  drift detected in scaling group {"driftedInstancesCount/DesiredInstancesCount": "(1/1)", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  selecting batch for rotation    {"batch size": 1, "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  found in-progress instances {"instances": ["i-0bbb077b2dab36ac5"]}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  rolling upgrade configured for forced refresh   {"instance": "i-0bbb077b2dab36ac5", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  rotating batch  {"instances": ["i-0bbb077b2dab36ac5"], "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  no InService instances in the batch {"batch": ["i-0bbb077b2dab36ac5"], "instances(InService)": [], "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  waiting for desired nodes   {"name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  desired nodes are ready {"name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.447Z    INFO    controllers.RollingUpgrade  draining the node   {"instance": "i-0bbb077b2dab36ac5", "node name": "ip-172-29-72-153.us-west-2.compute.internal", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager WARNING: ignoring DaemonSet-managed Pods: kube-system/cilium-9chsq, kube-system/clamav-akp-ck4p6, kube-system/ebs-csi-node-28fxf, kube-system/kiam-agent-w5lxw, kube-system/kube-proxy-s5d2n, kube-system/node-local-dns-t4z8q, monitoring/node-exporter-4qsg9, ossec/ossec-akp-x55sv
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-utility/nginx-ingress-utility-controller-65d6447d75-rpb6t
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-grpc/nginx-ingress-grpc-controller-7dd4c7b9f-q2mc5
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-public/nginx-ingress-public-controller-c664fcc7c-82p9x
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-default/nginx-ingress-default-controller-69d64b6b5-n9n5d
upgrade-manager-controller-manager-859c65b5db-gzfns manager evicting pod nginx-ing-bff/nginx-ingress-bff-controller-588cc868fc-d4vt2
### should the 1000 second pause not happen here????
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.627Z    INFO    controllers.RollingUpgrade  instances drained successfully, terminating {"name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.628Z    INFO    controllers.RollingUpgrade  terminating instance    {"instance": "i-0bbb077b2dab36ac5", "name": "instance-manager/platform-apm-us-west-2a-20220715002858-20"}
upgrade-manager-controller-manager-859c65b5db-gzfns manager 2022-11-18T05:43:16.867Z    INFO    controllers.RollingUpgrade  ***Reconciling***

jess-belliveau commented 1 year ago

Ah, I should have mentioned - the problem we are facing is the pods are still in terminating state when the underlying node is terminated. We are trying to configure the RU to have a wait to allow the pods to gracefully terminate as part of the drain.

shreyas-badiger commented 1 year ago

@jess-belliveau configuring drainTimeout is exactly what the name suggests. If the node drain doesn't complete within the drainTimeout value, then the rolling upgrade (RU) is marked as failed. So, if the drain command is completed within a second, the node is terminated right after.

If I understand it correctly, you are trying to delay the node termination. You should consider using the postDrain in spec where you can specify the wait before the termination is initiated.

jess-belliveau commented 1 year ago

@shreyas-badiger , thanks for response.

If you look at my spec at the start, I actually have set a postDrain.waitSeconds:

  postDrain:
    waitSeconds: 300

I hadn't even realised this field doesn't appear to work either. I'm not seeing a 300 second pause anywhere.

shreyas-badiger commented 1 year ago

@jess-belliveau I think the implementation for the postDrain.waitSeconds is missing. If you have some bandwidth, can you contribute? If not, I think you can use the postDrain script. https://github.com/keikoproj/upgrade-manager/blob/79b38c0290c72c6aa1e1d1a89a9d0b325ee2473b/controllers/script_runner.go#L109

jess-belliveau commented 1 year ago

Thanks @shreyas-badiger - I might be able to loop back in the future and see what contributions I can make.

For the time being, we are having promising results with;

"postDrain":
  "script": |
    count=10; while [ $count -gt 0 ]; do count=`kubectl get pods -A --field-selector spec.nodeName=$INSTANCE_NAME -o jsonpath='{range .items[?(.metadata.ownerReferences[*].kind!="DaemonSet")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | wc -l`; echo "$count pods draining"; sleep 10; done

the only caveat being, we have had to add some binaries to the rolling-upgrade-controller image - kubectl, wc and sleep.

keikoproj / upgrade-manager

strategy.drainTimeout not working as intended? #346