keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
93 stars 28 forks source link

Lifecycle manager doesn't retry node drain after timeout. #90

Closed shreyas-badiger closed 1 year ago

shreyas-badiger commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT What happened: Upon receiving the node termination event, lifecycle-manager will attempt to drain the node. The default timeout for drain command is 300s. In certain cases

In such cases, the node-drain might not succeed in one single attempt. (Even if we keep the drainTimeout to a higher value.)

Once the drain command times out, the lifecycle-manager will mark the event as ABANDON and proceed toward node termination.

This is very risky and could cause disruption for services hosted on the node.

What you expected to happen:

The lifecycle-manager must retry the node-drain operation (defined here) even when there are timeout errors.

How to reproduce it (as minimally and precisely as possible):

Possible fix: Retry the node-drain even when there is a drain timeout. Additionally, we should also make the number of retry attempts configurable.

func runCommandWithContext(call string, args []string, timeoutSeconds, retryInterval int64) error {
    // Create a new context and add a timeout to it
    ctx, cancel := context.WithTimeout(context.Background(), time.Duration(timeoutSeconds)*time.Second)
    defer cancel()
    err := retry.Do(
        func() error {
            cmd := exec.CommandContext(ctx, call, args...)
            _, err := cmd.CombinedOutput()
            if err != nil {
                return err
            }
            return nil
        },
        retry.RetryIf(func(err error) bool {
            if err != nil {
                log.Infoln("retrying drain")
                return true
            }
            return false
        }),
        retry.Attempts(3),
        retry.Delay(time.Duration(retryInterval)*time.Second),
    )
    if err != nil {
        return err
    }

    return nil
}

And the retries go through:

╰─ cat lifecycle_logs | grep "retrying"                                                                                                                                                                        ─╯
time="2023-04-27T19:39:56Z" level=info msg="retrying drain"
time="2023-04-27T19:40:26Z" level=info msg="retrying drain"
time="2023-04-27T19:41:26Z" level=info msg="retrying drain"