kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.82k stars 1.41k forks source link

Security Group unexpectedly revoked by Controller #3637

Open cjhawkins opened 2 months ago

cjhawkins commented 2 months ago

Describe the bug The aws-load-balancer-controller revoked the Shared Backend SG for LoadBalancer from the security group used by EKS nodes, causing an outage. After a few minutes, the controller added the security group back.

Steps to reproduce Unable to reproduce this.

Expected outcome SecurityGroups are not modified in a way that causes pods to be unreachable.

Environment

Additional Context: I recently had an issue where the load balancer controller revoked the Shared Backend SecurityGroup for LoadBalancer (sg-0fe###) as an Inbound rule from the security group used by nodes in our EKS cluster (sg-085###). This caused requests to the cluster to return a 503 error.

This is the CloudTrail log for the breaking security group change:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "#####################:######",
        "arn": "arn:aws:sts::############:assumed-role/irsa-production-alb-load-balancer-controller/######",
        "accountId": "############",
        "accessKeyId": "#####################",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "#####################",
                "arn": "arn:aws:iam::############:role/irsa-production-alb-load-balancer-controller",
                "accountId": "############",
                "userName": "irsa-production-alb-load-balancer-controller"
            },
            "webIdFederationData": {
                "federatedProvider": "arn:aws:iam::############:oidc-provider/oidc.eks.ca-central-1.amazonaws.com/id/#####################",
                "attributes": {}
            },
            "attributes": {
                "creationDate": "2024-03-24T14:32:16Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2024-03-24T14:39:04Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "RevokeSecurityGroupIngress",
    "awsRegion": "ca-central-1",
    "sourceIPAddress": "###.###.###.###",
    "userAgent": "elbv2.k8s.aws/v2.4.1 aws-sdk-go/1.42.27 (go1.17.8; linux; amd64)",
    "requestParameters": {
        "groupId": "sg-085###",
        "ipPermissions": {
            "items": [
                {
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "groups": {
                        "items": [
                            {
                                "userId": "############",
                                "groupId": "sg-0fe###",
                                "description": "elbv2.k8s.aws/targetGroupBinding=shared"
                            }
                        ]
                    },
                    "ipRanges": {},
                    "ipv6Ranges": {},
                    "prefixListIds": {}
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "c0db6584-48ed-415b-9762-3feccf1789fa",
        "_return": true,
        "revokedSecurityGroupRuleSet": {
            "items": [
                {
                    "groupId": "sg-085###",
                    "securityGroupRuleId": "sgr-0f1bc2e98ba0f9786",
                    "description": "elbv2.k8s.aws/targetGroupBinding=shared",
                    "isEgress": false,
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "referencedGroupId": "sg-0fe###"
                }
            ]
        }
    },
    "requestID": "c0db6584-48ed-415b-9762-3feccf1789fa",
    "eventID": "ac2601f1-5ed9-42fa-ba40-99d92d7e53bf",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "############",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "ec2.ca-central-1.amazonaws.com"
    }
}

A few minutes later, the ingress controller added the Shared Backend SecurityGroup for LoadBalancer back as an inbound rule and the cluster started serving requests again.

This is the CloudTrail log for the security group change that fixed the issue:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "#####################:######",
        "arn": "arn:aws:sts::############:assumed-role/irsa-production-alb-load-balancer-controller/######",
        "accountId": "############",
        "accessKeyId": "#####################",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "#####################",
                "arn": "arn:aws:iam::############:role/irsa-production-alb-load-balancer-controller",
                "accountId": "############",
                "userName": "irsa-production-alb-load-balancer-controller"
            },
            "webIdFederationData": {
                "federatedProvider": "arn:aws:iam::############:oidc-provider/oidc.eks.ca-central-1.amazonaws.com/id/#####################",
                "attributes": {}
            },
            "attributes": {
                "creationDate": "2024-03-24T14:32:16Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2024-03-24T14:47:32Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "AuthorizeSecurityGroupIngress",
    "awsRegion": "ca-central-1",
    "sourceIPAddress": "###.###.###.###",
    "userAgent": "elbv2.k8s.aws/v2.4.1 aws-sdk-go/1.42.27 (go1.17.8; linux; amd64)",
    "requestParameters": {
        "groupId": "sg-085###",
        "ipPermissions": {
            "items": [
                {
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "groups": {
                        "items": [
                            {
                                "groupId": "sg-0fe###",
                                "description": "elbv2.k8s.aws/targetGroupBinding=shared"
                            }
                        ]
                    },
                    "ipRanges": {},
                    "ipv6Ranges": {},
                    "prefixListIds": {}
                }
            ]
        }
    },
    "responseElements": {
        "requestId": "ac85fba4-09b3-4a76-95be-21de5879907a",
        "_return": true,
        "securityGroupRuleSet": {
            "items": [
                {
                    "groupOwnerId": "############",
                    "groupId": "sg-085###",
                    "securityGroupRuleId": "sgr-066###",
                    "description": "elbv2.k8s.aws/targetGroupBinding=shared",
                    "isEgress": false,
                    "ipProtocol": "tcp",
                    "fromPort": 80,
                    "toPort": 80,
                    "referencedGroupInfo": {
                        "userId": "############",
                        "groupId": "sg-0fe###"
                    }
                }
            ]
        }
    },
    "requestID": "ac85fba4-09b3-4a76-95be-21de5879907a",
    "eventID": "9de31dfa-9982-4af8-a5cb-0fc3eef65dee",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "############",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "ec2.ca-central-1.amazonaws.com"
    }
}

This probably doesn't add much, but these are the logs from around the time the security group changes were made:

Screenshot 2024-04-05 at 11 18 42 AM

Not sure if its relevant, but the EKS cluster has 5 different ALB's in front of it, each with their own domain, certificate and OIDC configuration.

Any insights into what caused this change, and how to make sure it doesn't happen again? Thank you.

M00nF1sh commented 2 months ago

@cjhawkins Is the logs of the controller pod still around? the controller pod logs should have the reason why it decided to remove the security group rule from worker nodes.

Would you mind cut a ticket to EKS support with the controller log?