Lifecycle manager doesn't retry node drain after timeout.

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT What happened: Upon receiving the node termination event, lifecycle-manager will attempt to drain the node. The default timeout for drain command is 300s. In certain cases

when there are pods with higher startup time (OR)
a large number of pods with smaller allowed-disruptions (in PDB),

In such cases, the node-drain might not succeed in one single attempt. (Even if we keep the drainTimeout to a higher value.)

Once the drain command times out, the lifecycle-manager will mark the event as ABANDON and proceed toward node termination.

This is very risky and could cause disruption for services hosted on the node.

What you expected to happen:

The lifecycle-manager must retry the node-drain operation (defined here) even when there are timeout errors.

How to reproduce it (as minimally and precisely as possible):

Create a Kubernetes dev cluster.
Install lifecycle-manager in the cluster.
Create a sample deployment in one of the instance-group.
Create a blocking PDB (with maxUnavailable set to 0) so that we can induce drain failures.
Terminate one of the ASG instances either on the console or through the CLI command.
Review the lifecycle-manager logs. There will be no retries.

Possible fix: Retry the node-drain even when there is a drain timeout. Additionally, we should also make the number of retry attempts configurable.

func runCommandWithContext(call string, args []string, timeoutSeconds, retryInterval int64) error {
    // Create a new context and add a timeout to it
    ctx, cancel := context.WithTimeout(context.Background(), time.Duration(timeoutSeconds)*time.Second)
    defer cancel()
    err := retry.Do(
        func() error {
            cmd := exec.CommandContext(ctx, call, args...)
            _, err := cmd.CombinedOutput()
            if err != nil {
                return err
            }
            return nil
        },
        retry.RetryIf(func(err error) bool {
            if err != nil {
                log.Infoln("retrying drain")
                return true
            }
            return false
        }),
        retry.Attempts(3),
        retry.Delay(time.Duration(retryInterval)*time.Second),
    )
    if err != nil {
        return err
    }

    return nil
}

And the retries go through:

╰─ cat lifecycle_logs | grep "retrying"                                                                                                                                                                        ─╯
time="2023-04-27T19:39:56Z" level=info msg="retrying drain"
time="2023-04-27T19:40:26Z" level=info msg="retrying drain"
time="2023-04-27T19:41:26Z" level=info msg="retrying drain"

keikoproj / lifecycle-manager

Lifecycle manager doesn't retry node drain after timeout. #90