Panic Error: Invalid Memory Address or Nil Pointer Dereference during Taint Deletion in AWSManagedMachinePool

nueavv commented 5 months ago

/kind bug

What steps did you take and what happened: While managing an EKS cluster using AWSManagedMachinePool, a panic error occurs when trying to delete a taint after adding it. Below are the related log messages.

Initial Error Log

E0613 05:58:09.596065 1 controller.go:329] "Reconciler error" err="panic: runtime error: invalid memory address or nil pointer dereference [recovered]" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="capi-managed-cluster/test-cluster" namespace="capi-managed-cluster" name="test-cluster" reconcileID="12345678-1234-1234-1234-123456789012"
I0613 05:58:09.917851 1 awsmanagedmachinepool_controller.go:200] "Reconciling AWSManagedMachinePool"
I0613 05:58:09.918011 1 launchtemplate.go:73] "checking for existing launch template"
E0613 05:58:10.373797 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 464 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1234560?, 0x1234560})
/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/runtime/runtime.go:75 +0x85
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:108 +0xb2
panic({0x1234560?, 0x1234560})
/usr/local/go/src/runtime/panic.go:914 +0x21f
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/converters.TaintsFromSDK({0xc001604d60, 0x1, 0x25})
/workspace/pkg/cloud/converters/eks.go:115 +0x258
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).createTaintsUpdate(0xc002522d20, {0x0, 0x0, 0x0}, 0xc00075c5a0)
/workspace/pkg/cloud/services/eks/nodegroup.go:420 +0x21d
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroupConfig(0xc002522d20, 0xc00075c5a0)
/workspace/pkg/cloud/services/eks/nodegroup.go:469 +0x36e
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroup(0xc002522d20, {0x1234560, 0xc002528570})
/workspace/pkg/cloud/services/eks/nodegroup.go:571 +0x72c
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).ReconcilePool(0xc002522d20, {0x1234560, 0xc002528570})
/workspace/pkg/cloud/services/eks/eks.go:109 +0x11a
sigs.k8s.io/cluster-api-provider-aws/v2/exp/controllers.(*AWSManagedMachinePoolReconciler).reconcileNormal(0xc001541980, {0x1234560, 0xc002528570}, 0xc0010baf00, {0x1234560, 0xc002138100})
/workspace/exp/controllers/awsmanagedmachinepool_controller.go:239 +0x4da
sigs.k8s.io/cluster-api-provider-aws/v2/exp/controllers.(*AWSManagedMachinePoolReconciler).Reconcile(0xc001541980, {0x1234560, 0xc002528570}, {{{0xc000913320?, 0x0?}, {0xc000913380?, 0xc000d75d08?}}})
/workspace/exp/controllers/awsmanagedmachinepool_controller.go:192 +0x8b7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1234560?, {0x1234560?, 0xc002528570?}, {{{0xc000913320?, 0xb?}, {0xc000913380?, 0x0?}}})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0014f4500, {0x1234560, 0xc0014264b0}, {0x1234560?, 0xc002360ea0?})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:316 +0x3cc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0014f4500, {0x1234560, 0xc0014264b0})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266 +0x1af
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg.internal.controller.(*Controller).Start.func2 in goroutine 321
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:223 +0x565

Updated Error Log

E0613 07:06:08.244153 1 controller.go:329] "Reconciler error" err=<
failed to reconcile machine pool for AWSManagedMachinePool capi-managed-cluster/test-cluster: failed to reconcile nodegroup config: creating taints update payload: converting taints: taint has nil fields: {
Effect: "PREFER_NO_SCHEDULE",
Key: "test"
}

What did you expect to happen: Taints should be deleted without causing a panic error.

Anything else you would like to add:

The panic error occurs when trying to delete a taint after adding it.
The error message indicates that some fields of the taint are nil, but the provided taints are complete.

Current TaintsFromSDK function implementation:

import (
    "fmt"
    "log"
)

// TaintsFromSDK is used to convert an array of AWS SDK taints to CAPA Taints.
func TaintsFromSDK(taints []*eks.Taint) (expinfrav1.Taints, error) {
    converted := expinfrav1.Taints{}

    // If taints array is empty, log the event
    if len(taints) == 0 {
        log.Println("Taints array is empty")
        return converted, nil
    }

    for _, taint := range taints {
        if taint.Effect == nil || taint.Key == nil {
            return nil, fmt.Errorf("taint has nil fields: %+v", taint)
        }

        // Handle nil Value field by setting it to an empty string
        value := ""
        if taint.Value != nil {
            value = *taint.Value
        }

        convertedEffect, err := TaintEffectFromSDK(*taint.Effect)
        if err != nil {
            return nil, fmt.Errorf("converting taint effect %s: %w", *taint.Effect, err)
        }
        converted = append(converted, expinfrav1.Taint{
            Effect: convertedEffect,
            Key:    *taint.Key,
            Value:  value,
        })
    }

    return converted, nil
}

Suggestions:

Resolve the panic error that occurs during taint deletion. When the taint array is empty, ensure that existing taints are deleted. Review the logic handling empty taint arrays and nil fields.

Environment:

Cluster-api-provider-aws version: [version] 2.5.0 Kubernetes version: (use kubectl version): [version] 1.25.5 on local kubernetes OS (e.g. from /etc/os-release): [version] ubuntu 20.04

k8s-ci-robot commented 5 months ago

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

nueavv commented 5 months ago

The issue appears to be caused by the value field in the taint having an empty string. The taint configuration is as follows:

This empty string in the value field might be leading to the observed behavior.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5021#issuecomment-2466676658): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-aws