kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
625 stars 541 forks source link

Panic Error: Invalid Memory Address or Nil Pointer Dereference during Taint Deletion in AWSManagedMachinePool #5021

Open nueavv opened 2 weeks ago

nueavv commented 2 weeks ago

/kind bug

What steps did you take and what happened: While managing an EKS cluster using AWSManagedMachinePool, a panic error occurs when trying to delete a taint after adding it. Below are the related log messages.

Initial Error Log
E0613 05:58:09.596065 1 controller.go:329] "Reconciler error" err="panic: runtime error: invalid memory address or nil pointer dereference [recovered]" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="capi-managed-cluster/test-cluster" namespace="capi-managed-cluster" name="test-cluster" reconcileID="12345678-1234-1234-1234-123456789012"
I0613 05:58:09.917851 1 awsmanagedmachinepool_controller.go:200] "Reconciling AWSManagedMachinePool"
I0613 05:58:09.918011 1 launchtemplate.go:73] "checking for existing launch template"
E0613 05:58:10.373797 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 464 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1234560?, 0x1234560})
/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/runtime/runtime.go:75 +0x85
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:108 +0xb2
panic({0x1234560?, 0x1234560})
/usr/local/go/src/runtime/panic.go:914 +0x21f
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/converters.TaintsFromSDK({0xc001604d60, 0x1, 0x25})
/workspace/pkg/cloud/converters/eks.go:115 +0x258
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).createTaintsUpdate(0xc002522d20, {0x0, 0x0, 0x0}, 0xc00075c5a0)
/workspace/pkg/cloud/services/eks/nodegroup.go:420 +0x21d
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroupConfig(0xc002522d20, 0xc00075c5a0)
/workspace/pkg/cloud/services/eks/nodegroup.go:469 +0x36e
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroup(0xc002522d20, {0x1234560, 0xc002528570})
/workspace/pkg/cloud/services/eks/nodegroup.go:571 +0x72c
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).ReconcilePool(0xc002522d20, {0x1234560, 0xc002528570})
/workspace/pkg/cloud/services/eks/eks.go:109 +0x11a
sigs.k8s.io/cluster-api-provider-aws/v2/exp/controllers.(*AWSManagedMachinePoolReconciler).reconcileNormal(0xc001541980, {0x1234560, 0xc002528570}, 0xc0010baf00, {0x1234560, 0xc002138100})
/workspace/exp/controllers/awsmanagedmachinepool_controller.go:239 +0x4da
sigs.k8s.io/cluster-api-provider-aws/v2/exp/controllers.(*AWSManagedMachinePoolReconciler).Reconcile(0xc001541980, {0x1234560, 0xc002528570}, {{{0xc000913320?, 0x0?}, {0xc000913380?, 0xc000d75d08?}}})
/workspace/exp/controllers/awsmanagedmachinepool_controller.go:192 +0x8b7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1234560?, {0x1234560?, 0xc002528570?}, {{{0xc000913320?, 0xb?}, {0xc000913380?, 0x0?}}})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0014f4500, {0x1234560, 0xc0014264b0}, {0x1234560?, 0xc002360ea0?})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:316 +0x3cc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0014f4500, {0x1234560, 0xc0014264b0})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266 +0x1af
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg.internal.controller.(*Controller).Start.func2 in goroutine 321
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:223 +0x565
Updated Error Log
E0613 07:06:08.244153 1 controller.go:329] "Reconciler error" err=<
failed to reconcile machine pool for AWSManagedMachinePool capi-managed-cluster/test-cluster: failed to reconcile nodegroup config: creating taints update payload: converting taints: taint has nil fields: {
Effect: "PREFER_NO_SCHEDULE",
Key: "test"
}

What did you expect to happen: Taints should be deleted without causing a panic error.

Anything else you would like to add:

Current TaintsFromSDK function implementation:

import (
    "fmt"
    "log"
)

// TaintsFromSDK is used to convert an array of AWS SDK taints to CAPA Taints.
func TaintsFromSDK(taints []*eks.Taint) (expinfrav1.Taints, error) {
    converted := expinfrav1.Taints{}

    // If taints array is empty, log the event
    if len(taints) == 0 {
        log.Println("Taints array is empty")
        return converted, nil
    }

    for _, taint := range taints {
        if taint.Effect == nil || taint.Key == nil {
            return nil, fmt.Errorf("taint has nil fields: %+v", taint)
        }

        // Handle nil Value field by setting it to an empty string
        value := ""
        if taint.Value != nil {
            value = *taint.Value
        }

        convertedEffect, err := TaintEffectFromSDK(*taint.Effect)
        if err != nil {
            return nil, fmt.Errorf("converting taint effect %s: %w", *taint.Effect, err)
        }
        converted = append(converted, expinfrav1.Taint{
            Effect: convertedEffect,
            Key:    *taint.Key,
            Value:  value,
        })
    }

    return converted, nil
}

Suggestions:

Resolve the panic error that occurs during taint deletion. When the taint array is empty, ensure that existing taints are deleted. Review the logic handling empty taint arrays and nil fields.

Environment:

Cluster-api-provider-aws version: [version] 2.5.0 Kubernetes version: (use kubectl version): [version] 1.25.5 on local kubernetes OS (e.g. from /etc/os-release): [version] ubuntu 20.04

k8s-ci-robot commented 2 weeks ago

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
nueavv commented 2 weeks ago

The issue appears to be caused by the value field in the taint having an empty string. The taint configuration is as follows:

image

This empty string in the value field might be leading to the observed behavior.