Closed datosh closed 1 month ago
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Do this happen consistently? Did you try manually cleaning up the SG to see why it was failing? I'm curious what is hanging around, causing this issue. Here's a post on troubleshooting the issue.
Depending on what's hanging around, we may or may not want to tackle it here. Perhaps, the load balancer is taking longer than expected to be terminated, or at least communicate back that it's terminated, and you can just delete the SG without any manual action, or there's some unexpected resource around.
Do this happen consistently?
Yes.
Did you try manually cleaning up the SG to see why it was failing?
I did. The setup is as follows: Let's call the "main" security group of my VPC sg-a
. When the ELB is created a new security group sg-b
is created (I suspect by cloud-provider-aws
) to manage allowed port mappings. An additional inbound rule is created in sg-a
which references sg-b
. This prevents the deletion of sg-b
when I remove the service of type loadBalancer
in K8s.
The kubectl delete ...
operation hangs until I manually delete the rule in sg-a
. With this manual intervention everything works as expected!
Depending on what's hanging around, we may or may not want to tackle it here. Perhaps, the load balancer is taking longer than expected to be terminated, or at least communicate back that it's terminated, and you can just delete the SG without any manual action, or there's some unexpected resource around.
Do you recognize this dependency and could provide me some further pointers?
Could this be a misconfiguration on my side? See the args I pass in my initial message.
I see. Can you share the config you're using to create the service with any private details changed? If you can provide yaml that consistently reproduces this (assuming the issue is not apparent from the config itself), it will help figure out what's actually happening.
I'm seeing the same issue; except that the load balancer does not get deleted in the end: I get left paying for an ELB and no Service
to match it.
I0315 01:33:30.907304 1 aws.go:4887] Waiting for load-balancer to delete so we can delete security groups: traefik
I0315 01:33:41.090973 1 aws.go:4863] Ignoring DependencyViolation while deleting load-balancer security group (sg-04f7679bdc2937e60), assuming because LB is in process of deleting
E0315 01:33:41.091168 1 controller.go:289] error processing service traefik/traefik (will retry): failed to delete load balancer: timed out deleting ELB: traefik. Could not delete security groups sg-04f7679bdc2937e60
I0315 01:33:41.092665 1 event.go:294] "Event occurred" object="traefik/traefik" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to delete load balancer: timed out deleting ELB: traefik. Could not delete security groups sg-04f7679bdc2937e60"
I0315 01:33:41.115600 1 controller.go:890] Removing finalizer from service traefik/traefik
I0315 01:33:41.131181 1 controller.go:916] Patching status for service traefik/traefik
I0315 01:33:41.132914 1 event.go:294] "Event occurred" object="traefik/traefik" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="DeletedLoadBalancer" message="Deleted load balancer"
Any updates on this issue? We're still seeing this on a daily basis.
Sorry @mmerkes, I missed your comment. Here is the config we are using (this is Kubernetes v1.26.3):
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
meta.helm.sh/release-name: constellation-services
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2023-03-30T07:49:59Z"
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
k8s-app: cloud-controller-manager
name: cloud-controller-manager
namespace: kube-system
resourceVersion: "974"
uid: 8e90444c-81cf-453e-a4b1-4a94920b61d5
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: cloud-controller-manager
template:
metadata:
creationTimestamp: null
labels:
k8s-app: cloud-controller-manager
spec:
containers:
- args:
- --cloud-provider=aws
- --leader-elect=true
- --allocate-node-cidrs=false
- --configure-cloud-routes=false
- -v=2
image: registry.k8s.io/provider-aws/cloud-controller-manager:v1.26.1@sha256:2a43d2d5611ba920c49e23127cfd474fb7932fcade1671dddbef757921fcdb40
imagePullPolicy: IfNotPresent
name: cloud-controller-manager
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/kubernetes
name: etckubernetes
readOnly: true
- mountPath: /etc/ssl
name: etcssl
readOnly: true
- mountPath: /etc/pki
name: etcpki
readOnly: true
dnsPolicy: ClusterFirst
nodeSelector:
node-role.kubernetes.io/control-plane: ""
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: cloud-controller-manager
serviceAccountName: cloud-controller-manager
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/not-ready
volumes:
- hostPath:
path: /etc/kubernetes
type: ""
name: etckubernetes
- hostPath:
path: /etc/ssl
type: ""
name: etcssl
- hostPath:
path: /etc/pki
type: ""
name: etcpki
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 1
desiredNumberScheduled: 1
numberAvailable: 1
numberMisscheduled: 0
numberReady: 1
observedGeneration: 1
updatedNumberScheduled: 1
@kishorj Can you describe the issue we discussed here? Or create a different tracking issue in lieu of this one?
The loadbalancer service we are deploying looks like this:
apiVersion: v1
kind: Service
metadata:
name: whoami
namespace: lb-test
spec:
selector:
app: whoami
ports:
- port: 8080
targetPort: 80
type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: whoami
namespace: lb-test
labels:
app: whoami
spec:
replicas: 3
selector:
matchLabels:
app: whoami
template:
metadata:
labels:
app: whoami
spec:
containers:
- name: whoami
image: traefik/whoami:v1.8.7
ports:
- containerPort: 80
args:
- "--verbose"
As the security group leaks, Terraform can't destroy the infrastructure afterwards, and fails with
Error: deleting EC2 VPC (vpc-02d83a11b3188a569): DependencyViolation: The vpc 'vpc-02d83a11b3188a569' has dependencies and cannot be deleted.
status code: 400, request id: c597c847-7023-42cf-a1dc-bd919cde0ff0
Let me know if I can provide any further information needed to debug.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Adding to this: The security group fails to be deleted because it is associated to another SG (not managed by the ccm). In fact, we can see in Terraform that adding the K8s service adds an association to the externally managed SG:
# aws_security_group.security_group will be updated in-place
~ resource "aws_security_group" "security_group" {
id = "sg-0a73b4c28b733b0b8"
~ ingress = [
- {
- cidr_blocks = []
- description = ""
- from_port = 0
- ipv6_cidr_blocks = []
- prefix_list_ids = []
- protocol = "-1"
- security_groups = [
- "sg-088cece593c30ebfd", ]
- self = false
- to_port = 0
},
# (6 unchanged elements hidden)
]
name = "constell-a1eaed65"
tags = {
"constellation-uid" = "a1eaed65"
}
# (7 unchanged attributes hidden)
}
Plan: 0 to add, 1 to change, 0 to destroy.
So the SG deletion is blocked due to this association. If we delete the association during the deletion of the service (e.g. through terraform apply
) resources are cleared up as expected
We fixed the issue by using the aws-load-balancer-controller
and disabling the service management in the AWS cloud controller manager (CCM) through this flag: --controllers=*,-service
.
It's very confusing that the functionality is split between these 2 controllers and that it's nowhere documented that the CCM service management seems to be not actively maintained anymore (see https://github.com/kubernetes/cloud-provider-aws/issues/2030)
I've been looking into this issue and can reliably reproduce the issue. The important part here is the cluster tag not being added to the instance security group.
Whether we are adding a load balancer, or deleting it, we use the function updateInstanceSecurityGroupsForLoadBalancer to update the instance security groups and allow traffic from the load balancer security group to flow into the instance security group. It's this rule added by this function that is not cleaned up on delete.
When we are creating or updating the load balancer, the instances map passed into this function contains a list of all instances in the cluster. When we delete the load balancer, a nil map is passed.
The issue here is how we handle actualGroups
. The block that handles this describes all security groups that have an ip permission that relies on the load balancer security group. It then assumes that we should only modify those that are tagged with the cluster tag. In fact, on removal only, we should be modifying any group, whether it is tagged or not.
The next phase goes through the instances and this allows untagged groups. So we add rules to untagged groups, because the instances findSecurityGroupForInstance
result returns an untagged group. But we cannot remove the rules from the untagged group unless we are passed a list of instances that would allow us to find them, but that's not the desired pattern here, since it compares the list of current instances with the list of groups that exist presently to work out what it should be adding.
I think the issue here is that we shouldn't be excluding security groups based on the hasClusterTag
function when we find the actual groups. This leads to several issues as far as I can tell:
My proposal for fixing this would be to:
isDeleting bool
isDeleting
is false and the code asserts that a securityGroupID is needed to be removed, but is unmanaged, it does not operatorisDeleting
is true, remove all securityGroupIDs whether managed or unmanagedThe Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
What happened:
I have deployed CCM in my cluster for cloud provider AWS. When creating a service of type loadblancer the resources in AWS (security group & LB) are created correctly. When I delete the service of type loadbalancer in K8s, the CCM is unable to delete the security group. See relevant logs:
In the end, only the load balancer is deleted. The security group remains.
This prevents me from later deleting my VPC via Terraform, since the security group created by CCM is unknown to Terraform and
terraform destroy
fails.What you expected to happen:
I expected the security group that was created by CCM for LB to be deleted when service of type loadbalancer was deleted.
How to reproduce it (as minimally and precisely as possible):
Create service of type loadbalancer. Wait for security group & lb to be created in AWS. Delete service of type load balancer.
Anything else we need to know?:
CCM is deployed with the following args
Environment:
kubectl version
): v1.25.6uname -a
):/kind bug