Could not delete security groups

datosh commented 1 year ago

What happened:

I have deployed CCM in my cluster for cloud provider AWS. When creating a service of type loadblancer the resources in AWS (security group & LB) are created correctly. When I delete the service of type loadbalancer in K8s, the CCM is unable to delete the security group. See relevant logs:

I0201 11:21:50.382619       1 aws.go:3170] Comparing sg-02876532fa250bc14 to sg-02876532fa250bc14
W0201 11:21:50.382628       1 aws.go:4627] Allowing ingress was not needed; concurrent change? groupId=sg-0774fbdd7b1f31e00
I0201 11:21:50.382647       1 aws.go:4372] Loadbalancer a6869f7a66c5c46a099a99d010693ca5 (lb-test/whoami) has DNS name a6869f7a66c5c46a099a99d010693ca5-1668642184.eu-central-1.elb.amazonaws.com
I0201 11:21:50.382783       1 event.go:294] "Event occurred" object="lb-test/whoami" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
I0201 11:26:32.211973       1 controller.go:398] Deleting existing load balancer for service lb-test/whoami
I0201 11:26:32.212398       1 event.go:294] "Event occurred" object="lb-test/whoami" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="DeletingLoadBalancer" message="Deleting load balancer"
I0201 11:26:32.869424       1 aws.go:4797] Ignoring DependencyViolation while deleting load-balancer security group (sg-02876532fa250bc14), assuming because LB is in process of deleting
I0201 11:26:32.869514       1 aws.go:4821] Waiting for load-balancer to delete so we can delete security groups: whoami
[ ... 10 minutes of retrying ... ]
I0201 11:36:32.173123       1 aws.go:4797] Ignoring DependencyViolation while deleting load-balancer security group (sg-02876532fa250bc14), assuming because LB is in process of deleting
I0201 11:36:32.173211       1 aws.go:4821] Waiting for load-balancer to delete so we can delete security groups: whoami
I0201 11:36:42.523796       1 aws.go:4797] Ignoring DependencyViolation while deleting load-balancer security group (sg-02876532fa250bc14), assuming because LB is in process of deleting
E0201 11:36:42.523905       1 controller.go:320] error processing service lb-test/whoami (will retry): failed to delete load balancer: timed out deleting ELB: whoami. Could not delete security groups sg-02876532fa250bc14
I0201 11:36:42.524064       1 event.go:294] "Event occurred" object="lb-test/whoami" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to delete load balancer: timed out deleting ELB: whoami. Could not delete security groups sg-02876532fa250bc14"
I0201 11:36:42.547781       1 controller.go:958] Removing finalizer from service lb-test/whoami
I0201 11:36:42.553335       1 controller.go:984] Patching status for service lb-test/whoami
I0201 11:36:42.554001       1 event.go:294] "Event occurred" object="lb-test/whoami" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="DeletedLoadBalancer" message="Deleted load balancer"

In the end, only the load balancer is deleted. The security group remains.

This prevents me from later deleting my VPC via Terraform, since the security group created by CCM is unknown to Terraform and terraform destroy fails.

What you expected to happen:

I expected the security group that was created by CCM for LB to be deleted when service of type loadbalancer was deleted.

How to reproduce it (as minimally and precisely as possible):

Create service of type loadbalancer. Wait for security group & lb to be created in AWS. Delete service of type load balancer.

Anything else we need to know?:

CCM is deployed with the following args

--cloud-provider=aws
--leader-elect=true
--allocate-node-cidrs=false
--configure-cloud-routes=false
-v=2

Environment:

Kubernetes version (use kubectl version): v1.25.6
Cloud provider or hardware configuration: aws
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

/kind bug

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mmerkes commented 1 year ago

Do this happen consistently? Did you try manually cleaning up the SG to see why it was failing? I'm curious what is hanging around, causing this issue. Here's a post on troubleshooting the issue.

Depending on what's hanging around, we may or may not want to tackle it here. Perhaps, the load balancer is taking longer than expected to be terminated, or at least communicate back that it's terminated, and you can just delete the SG without any manual action, or there's some unexpected resource around.

datosh commented 1 year ago

Do this happen consistently?

Yes.

Did you try manually cleaning up the SG to see why it was failing?

I did. The setup is as follows: Let's call the "main" security group of my VPC sg-a. When the ELB is created a new security group sg-b is created (I suspect by cloud-provider-aws) to manage allowed port mappings. An additional inbound rule is created in sg-a which references sg-b. This prevents the deletion of sg-b when I remove the service of type loadBalancer in K8s.

The kubectl delete ... operation hangs until I manually delete the rule in sg-a. With this manual intervention everything works as expected!

Depending on what's hanging around, we may or may not want to tackle it here. Perhaps, the load balancer is taking longer than expected to be terminated, or at least communicate back that it's terminated, and you can just delete the SG without any manual action, or there's some unexpected resource around.

Do you recognize this dependency and could provide me some further pointers?

Could this be a misconfiguration on my side? See the args I pass in my initial message.

mmerkes commented 1 year ago

I see. Can you share the config you're using to create the service with any private details changed? If you can provide yaml that consistently reproduces this (assuming the issue is not apparent from the config itself), it will help figure out what's actually happening.

james-callahan commented 1 year ago

I'm seeing the same issue; except that the load balancer does not get deleted in the end: I get left paying for an ELB and no Service to match it.

I0315 01:33:30.907304       1 aws.go:4887] Waiting for load-balancer to delete so we can delete security groups: traefik
I0315 01:33:41.090973       1 aws.go:4863] Ignoring DependencyViolation while deleting load-balancer security group (sg-04f7679bdc2937e60), assuming because LB is in process of deleting
E0315 01:33:41.091168       1 controller.go:289] error processing service traefik/traefik (will retry): failed to delete load balancer: timed out deleting ELB: traefik. Could not delete security groups sg-04f7679bdc2937e60
I0315 01:33:41.092665       1 event.go:294] "Event occurred" object="traefik/traefik" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to delete load balancer: timed out deleting ELB: traefik. Could not delete security groups sg-04f7679bdc2937e60"
I0315 01:33:41.115600       1 controller.go:890] Removing finalizer from service traefik/traefik
I0315 01:33:41.131181       1 controller.go:916] Patching status for service traefik/traefik
I0315 01:33:41.132914       1 event.go:294] "Event occurred" object="traefik/traefik" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="DeletedLoadBalancer" message="Deleted load balancer"

katexochen commented 1 year ago

Any updates on this issue? We're still seeing this on a daily basis.

katexochen commented 1 year ago

Sorry @mmerkes, I missed your comment. Here is the config we are using (this is Kubernetes v1.26.3):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
    meta.helm.sh/release-name: constellation-services
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2023-03-30T07:49:59Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    k8s-app: cloud-controller-manager
  name: cloud-controller-manager
  namespace: kube-system
  resourceVersion: "974"
  uid: 8e90444c-81cf-453e-a4b1-4a94920b61d5
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: cloud-controller-manager
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: cloud-controller-manager
    spec:
      containers:
      - args:
        - --cloud-provider=aws
        - --leader-elect=true
        - --allocate-node-cidrs=false
        - --configure-cloud-routes=false
        - -v=2
        image: registry.k8s.io/provider-aws/cloud-controller-manager:v1.26.1@sha256:2a43d2d5611ba920c49e23127cfd474fb7932fcade1671dddbef757921fcdb40
        imagePullPolicy: IfNotPresent
        name: cloud-controller-manager
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/kubernetes
          name: etckubernetes
          readOnly: true
        - mountPath: /etc/ssl
          name: etcssl
          readOnly: true
        - mountPath: /etc/pki
          name: etcpki
          readOnly: true
      dnsPolicy: ClusterFirst
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: cloud-controller-manager
      serviceAccountName: cloud-controller-manager
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node.cloudprovider.kubernetes.io/uninitialized
        value: "true"
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
        operator: Exists
      - effect: NoSchedule
        key: node.kubernetes.io/not-ready
      volumes:
      - hostPath:
          path: /etc/kubernetes
          type: ""
        name: etckubernetes
      - hostPath:
          path: /etc/ssl
          type: ""
        name: etcssl
      - hostPath:
          path: /etc/pki
          type: ""
        name: etcpki
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 1
  desiredNumberScheduled: 1
  numberAvailable: 1
  numberMisscheduled: 0
  numberReady: 1
  observedGeneration: 1
  updatedNumberScheduled: 1

mmerkes commented 1 year ago

@kishorj Can you describe the issue we discussed here? Or create a different tracking issue in lieu of this one?

katexochen commented 1 year ago

The loadbalancer service we are deploying looks like this:

apiVersion: v1
kind: Service
metadata:
  name: whoami
  namespace: lb-test
spec:
  selector:
    app: whoami
  ports:
    - port: 8080
      targetPort: 80
  type: LoadBalancer

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: whoami
  namespace: lb-test
  labels:
    app: whoami
spec:
  replicas: 3
  selector:
    matchLabels:
      app: whoami
  template:
    metadata:
      labels:
        app: whoami
    spec:
      containers:
      - name: whoami
        image: traefik/whoami:v1.8.7
        ports:
        - containerPort: 80
        args:
        - "--verbose"

As the security group leaks, Terraform can't destroy the infrastructure afterwards, and fails with

Error: deleting EC2 VPC (vpc-02d83a11b3188a569): DependencyViolation: The vpc 'vpc-02d83a11b3188a569' has dependencies and cannot be deleted.
        status code: 400, request id: c597c847-7023-42cf-a1dc-bd919cde0ff0

Let me know if I can provide any further information needed to debug.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

katexochen commented 1 year ago

/remove-lifecycle stale

elchead commented 1 year ago

Adding to this: The security group fails to be deleted because it is associated to another SG (not managed by the ccm). In fact, we can see in Terraform that adding the K8s service adds an association to the externally managed SG:

  # aws_security_group.security_group will be updated in-place
  ~ resource "aws_security_group" "security_group" {
        id                     = "sg-0a73b4c28b733b0b8"
      ~ ingress                = [
          - {
              - cidr_blocks      = []
              - description      = ""
              - from_port        = 0
              - ipv6_cidr_blocks = []
              - prefix_list_ids  = []
              - protocol         = "-1"
              - security_groups  = [
                  - "sg-088cece593c30ebfd",                ]
              - self             = false
              - to_port          = 0
            },
            # (6 unchanged elements hidden)
        ]
        name                   = "constell-a1eaed65"
        tags                   = {
            "constellation-uid" = "a1eaed65"
        }
        # (7 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

So the SG deletion is blocked due to this association. If we delete the association during the deletion of the service (e.g. through terraform apply) resources are cleared up as expected

elchead commented 1 year ago

We fixed the issue by using the aws-load-balancer-controller and disabling the service management in the AWS cloud controller manager (CCM) through this flag: --controllers=*,-service.

It's very confusing that the functionality is split between these 2 controllers and that it's nowhere documented that the CCM service management seems to be not actively maintained anymore (see https://github.com/kubernetes/cloud-provider-aws/issues/2030)

JoelSpeed commented 8 months ago

I've been looking into this issue and can reliably reproduce the issue. The important part here is the cluster tag not being added to the instance security group.

Whether we are adding a load balancer, or deleting it, we use the function updateInstanceSecurityGroupsForLoadBalancer to update the instance security groups and allow traffic from the load balancer security group to flow into the instance security group. It's this rule added by this function that is not cleaned up on delete.

When we are creating or updating the load balancer, the instances map passed into this function contains a list of all instances in the cluster. When we delete the load balancer, a nil map is passed.

The issue here is how we handle actualGroups. The block that handles this describes all security groups that have an ip permission that relies on the load balancer security group. It then assumes that we should only modify those that are tagged with the cluster tag. In fact, on removal only, we should be modifying any group, whether it is tagged or not.

The next phase goes through the instances and this allows untagged groups. So we add rules to untagged groups, because the instances findSecurityGroupForInstance result returns an untagged group. But we cannot remove the rules from the untagged group unless we are passed a list of instances that would allow us to find them, but that's not the desired pattern here, since it compares the list of current instances with the list of groups that exist presently to work out what it should be adding.

I think the issue here is that we shouldn't be excluding security groups based on the hasClusterTag function when we find the actual groups. This leads to several issues as far as I can tell:

We cannot remove these rules when we are deleting
The actualGroups will never contain an untagged group, meaning that the instancesGroup, whenever there is an untagged group present, will always assert that it needs to add a permission to the untagged group

My proposal for fixing this would be to:

Remove the hasClusterTag check when collating the security group IDs
Add a new param to the function that serves as isDeleting bool
When isDeleting is false and the code asserts that a securityGroupID is needed to be removed, but is unmanaged, it does not operator
When isDeleting is true, remove all securityGroupIDs whether managed or unmanaged

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

katexochen commented 6 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/cloud-provider-aws/issues/566#issuecomment-2175481283): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / cloud-provider-aws

Could not delete security groups #566