kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.95k stars 4.65k forks source link

Mistake in validation of Node Termination Handler #16587

Open flipsed opened 5 months ago

flipsed commented 5 months ago

/kind bug

1. What kops version are you running? The command kops version, will display this information.

1.28

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.28

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops1.28.4 replace --force -f /path/to/kops.yaml

5. What happened after the commands executed?

Error: error replacing cluster: spec.cloudProvider.aws.nodeTerminationHandler.enableScheduledEventDraining: Forbidden: scheduled event draining cannot be disabled in Queue Processor mode

6. What did you expect to happen?

I would expect to be able to have enabledScheduledEventDraining disabled in the config while in SQS mode. The kops validation is running this code which is problematic:

func validateNodeTerminationHandler(cluster *kops.Cluster, spec *kops.NodeTerminationHandlerSpec, fldPath *field.Path) (allErrs field.ErrorList) {
    if spec.IsQueueMode() {
        if spec.EnableSpotInterruptionDraining != nil && !*spec.EnableSpotInterruptionDraining {
            allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableSpotInterruptionDraining"), "spot interruption draining cannot be disabled in Queue Processor mode"))
        }
        if spec.EnableScheduledEventDraining != nil && !*spec.EnableScheduledEventDraining {
            allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableScheduledEventDraining"), "scheduled event draining cannot be disabled in Queue Processor mode"))
        }
        if !fi.ValueOf(spec.EnableRebalanceDraining) && fi.ValueOf(spec.EnableRebalanceMonitoring) {
            allErrs = append(allErrs, field.Forbidden(fldPath.Child("enableRebalanceMonitoring"), "rebalance events can only drain in Queue Processor mode"))
        }
    }
    return allErrs
}

Based on the AWS Node Termination Handler documention, enableScheduledEventDraining is only applicable in IMDS mode image. While performing kops and kubernetes upgrades of our cluster, we ran into the error above.

Looking at the AWS Node Termination Handler source code, we can see that scheduled event draining is only used when !imdsDisabled (or when imds is enabled)

    if !imdsDisabled && nthConfig.EnableScheduledEventDraining {
        //will retry 4 times with an interval of 2 seconds.
        pollCtx, cancelPollCtx := context.WithTimeout(context.Background(), 8*time.Second)
        err = wait.PollUntilContextCancel(pollCtx, 2*time.Second, true, func(context.Context) (done bool, err error) {
            err = handleRebootUncordon(nthConfig.NodeName, interruptionEventStore, *node)
            if err != nil {
                log.Warn().Err(err).Msgf("Unable to complete the uncordon after reboot workflow on startup, retrying")
            }
            return false, nil
        })
        if err != nil {
            log.Warn().Err(err).Msgf("All retries failed, unable to complete the uncordon after reboot workflow")
        }
        cancelPollCtx()
    }

We should be able to disable Scheduled Event Draining while in SQS mode since it has no impact @johngmyers. Maybe I'm missing something here?

7. Please provide your cluster manifest. This is the relevant part:

  nodeTerminationHandler:
    enabled: true
    enableSQSTerminationDraining: true
    managedASGTag: "aws-node-termination-handler/managed"
    cpuRequest: 200m
    prometheusEnable: true
    enableRebalanceMonitoring: false
    enableRebalanceDraining: false
    enableSpotInterruptionDraining: true
    enableScheduledEventDraining: false

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale