EKS node group intermittently fails to delete on EKS cluster deletion

abhinavnagaraj commented 2 years ago

/kind bug

What steps did you take and what happened: Launch an EKS cluster with bastion enabled and one managed node group of size 5. Delete the 'Cluster'. The node group deletion fails with the "Delete failed" error in the aws console. The Health issue of the node group indicates "Ec2SecurityGroupDeletionFailure" --"DependencyViolation - resource has a dependent object". The affected security group is "eks-remoteAccess-*"

This is the error in CAPA logs: "msg"="Reconciler error" "error"="failed to reconcile machine pool deletion for AWSManagedMachinePool ns/workerpool-abc: failed to delete nodegroup: failed waiting for EKS nodegroup workerpool-abc to delete: ResourceNotReady: failed waiting for successful resource state".

The worker nodes/instances are deleted, bastion node is still running. The security-groups 'eks-cluster-sg-', '-node-eks-additional' and '*-bastion' are not deleted. Some network interfaces are still in-use. One network interface is in 'Available' state.

apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: AWSManagedControlPlane
metadata:
  name: capi-eks-quickstart-control-plane
  namespace: default
spec:
  eksClusterName: capi-eks-quickstart
  region: us-east-1
  version: v1.19.0
  bastion:
    allowedCIDRBlocks:
    - 0.0.0.0/0
    enabled: true
---
apiVersion: exp.cluster.x-k8s.io/v1alpha3
kind: MachinePool
metadata:
  name: capi-eks-quickstart-pool-0
  namespace: default
spec:
  clusterName: capi-eks-quickstart
  replicas: 5
  template:
    spec:
      bootstrap:
        dataSecretName: ""
      clusterName: capi-eks-quickstart
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
        kind: AWSManagedMachinePool
        name: capi-eks-quickstart-pool-0
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: AWSManagedMachinePool
metadata:
  name: capi-eks-quickstart-pool-0
  namespace: default
spec:
  eksNodegroupName: node-group-0
---

What did you expect to happen: Expected the cluster to be deleted and all the associated resources to be cleaned up in AWS.

Anything else you would like to add: Manually deleting the ENI in 'Available' state resulted in the deletion to progress and eventually succeed. This is an intermittent issue. Possibly related to an EKS issue(dangling ENI) which is open for some time. Tried bumping the version of vpc-cni. That didn't make a difference.

Environment:

Cluster-api-provider-aws version: v0.6.5 - also reproducible with v0.6.8
Kubernetes version: (use kubectl version): 1.19.0

richardcase commented 2 years ago

/area provider/eks /priority important-soon /help /milestone v0.7.x

k8s-ci-robot commented 2 years ago

@richardcase: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/2665): >/area provider/eks >/priority important-soon >/help >/milestone v0.7.x Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

richardcase commented 2 years ago

/lifecycle frozen

richardcase commented 1 year ago

/remove-lifecycle frozen

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-sigs / cluster-api-provider-aws

EKS node group intermittently fails to delete on EKS cluster deletion #2665