Reconciler tries to delete security groups in use during cluster deletion

jfcavalcante commented 6 months ago

/kind bug

What steps did you take and what happened:

After deleting a newly provisioned cluster, I've could see that the deletion process isn't running smoothly. Even during the deletion state, CAPA seems to try to delete some used security groups.

It looks like the reconciler cannot filter used security groups before trying to delete the resource, resulting in this error, which can be confusing for a new user of ClusterAPI.

E0518 20:50:20.738949       1 controller.go:329] "Reconciler error" err=<
    [error deleting security groups: [failed to delete security group "sg-092fbe4cd72832838" with name "capi-quickstart-apiserver-lb": DependencyViolation: resource sg-092fbe4cd72832838 has a dependent object
        status code: 400, request id: cd701b81-82c1-443b-8dd9-68202c055253, failed to delete security group "sg-0cc1b0c811e30c540" with name "capi-quickstart-controlplane": DependencyViolation: resource sg-0cc1b0c811e30c540 has a dependent object
        status code: 400, request id: ee4da697-4b02-47f8-9498-5f75cc952d66], error deleting network: failed to delete vpc "vpc-0bf1981f720b6c560": DependencyViolation: The vpc 'vpc-0bf1981f720b6c560' has dependencies and cannot be deleted.
        status code: 400, request id: b89ff4bd-718d-4d4e-928e-dd299e5b4b01]
 > controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/capi-quickstart" namespace="default" name="capi-quickstart" reconcileID="07010877-5547-4634-9c61-99386120deed"
I0518 20:50:20.739420       1 awscluster_controller.go:208] "Reconciling AWSCluster delete" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/capi-quickstart" namespace="default" name="capi-quickstart" reconcileID="851dc5ae-6cc8-46ae-b436-0cce6c06bf57" cluster="default/capi-quickstart"

What did you expect to happen:

The controller to check if the resources related to a cluster are able to be deleted.

Environment:

Cluster-api-provider-aws version: v2.5.0
Kubernetes version: (use kubectl version): v1.29.2
OS (e.g. from /etc/os-release): ubuntu v22.04.4

k8s-ci-robot commented 6 months ago

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

AndiDog commented 1 month ago

Are you sure that you only deleted the Cluster object, and not anything else? It's CAPI which deletes all subresources in the correct order. So for example, you must not delete the AWSCluster object, but CAPI will do that for you, only at the time when the infrastructure (incl. security groups) can go away. For GitOps scenarios, there's usually a label or annotation (such as helm.sh/resource-policy: keep) that can be used so that the CD controller only deletes Cluster and not its children.

HomayoonAlimohammadi commented 1 week ago

Hey! I'm experiencing the same issue. Deployed a workload cluster with AWS as the infra provider, and upon deletion, the CAPA fails with the same error.

I1107 08:46:25.164845       1 awscluster_controller.go:207] "Reconciling AWSCluster delete" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/test-ci-cluster" namespace="default" name="test-ci-cluster" reconcileID="1a2e8b21-b028-4bda-8b16-b44226f47e63" cluster="default/test-ci-cluster"
E1107 08:46:39.801982       1 controller.go:324] "Reconciler error" err=<
    [error deleting security groups: [failed to delete security group "sg-061b6d220573a793a" with name "test-ci-cluster-controlplane": DependencyViolation: resource sg-061b6d220573a793a has a dependent object
        status code: 400, request id: a27d8c3a-7f2c-4a25-b0bd-4ea992818b40, failed to delete security group "sg-080e3eb23d27d136a" with name "test-ci-cluster-lb": DependencyViolation: resource sg-080e3eb23d27d136a has a dependent object
        status code: 400, request id: 733a964d-b0f7-418e-a0d1-14ff9c1957d4, failed to delete security group "sg-0174a690bbf1915be" with name "test-ci-cluster-node": DependencyViolation: resource sg-0174a690bbf1915be has a dependent object
        status code: 400, request id: 6c0c1790-d761-4b67-9091-45f5f3d706f4], error deleting network: failed to delete subnet "subnet-0547029e44dd9cac0": DependencyViolation: The subnet 'subnet-0547029e44dd9cac0' has dependencies and cannot be deleted.
        status code: 400, request id: 73001b9c-37bd-4756-82e3-9abed3f56fc0]
 > controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/test-ci-cluster" namespace="default" name="test-ci-cluster" reconcileID="1a2e8b21-b028-4bda-8b16-b44226f47e63"

Used clusterctl init -i aws -b - -c - (with custom controllers deployed later) CAPA image: registry.k8s.io/cluster-api-aws/cluster-api-aws-controller:v2.7.1 Clusterctl version:

clusterctl version: &version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.4", GitCommit:"3cce0d973682f11ab0f0ba1c2522eba66dac2d91", GitTreeState:"clean", BuildDate:"2024-10-08T15:37:26Z", GoVersion:"go1.22.7", Compiler:"gc", Platform:"linux/amd64"}

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.5 LTS
Release:    22.04
Codename:   jammy

Management cluster k8s (Canonical K8s, snap install k8s --classic --edge):

Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.2

kubernetes-sigs / cluster-api-provider-aws

Reconciler tries to delete security groups in use during cluster deletion #4985