kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
636 stars 561 forks source link

[2.3.0/1] Defect: controller removes all vpc endpoints whether owned or not on cluster delete #4714

Open philjb opened 8 months ago

philjb commented 8 months ago

/kind bug

What steps did you take and what happened: CAPA controller deletes all VPC Endpoints associated with the cluster managed VPC, when deleting a cluster.

This has not caused an issue for me but I believe it is a defect that will impact some users.

This potential issue introduced in pr https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4577.

I believe that the delete logic for removing vpc endpoints is too aggressive func (s *Service) deleteVPCEndpoints() error in pkg/cloud/services/network/vpc.go. If CAPA creates a vpc endpoint for its capa-managed s3 bucket, it should delete only that vpc endpoint. I know that capa can't delete its managed vpc until all vpc endpoints attached it are removed but CAPA shouldn't be deleting vpc endpoints that other controllers have created (ACK or custom etc). We use ACK to create vpc endpoints attached to the CAPA managed vpc. CAPA should really be letting the creating controllers delete those resources.

This hasn't caused an issue with cluster teardown that i've seen for us, but CAPA is deleting VPC endpoints owned by other controllers.

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/bba04c43cd7a25b1384aa22a2ebcb65a8da97c4b/pkg/cloud/services/network/vpc.go#L252-L284

What did you expect to happen: CAPA controller only deletes resources it creates or have been added to CAPA for management (through "adoption" via tags).

Anything else you would like to add:

I think an appropriate solution is for CAPA to tag a VPC endpoint so it knows which are "its" and then use that tag to find those it should delete. This is the pattern for other resources. Then if CAPA controller can't delete the managed VPC, it waits and retries in a backoff until it can.

Environment:

AndiDog commented 8 months ago

/triage accepted