kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.89k stars 1.45k forks source link

Incompatibility with non-EKS clusters. #3708

Open is-it-ayush opened 4 months ago

is-it-ayush commented 4 months ago

Describe the bug I was trying to expose my services to the internet on a self-managed kubeadm cluster running on AWS EC2 in a secure way. The only way to expose cluster services to the outside internet is to provision an ALB or NLB from AWS. However, I noticed two problems in my experiments while trying to provision load balancer's with aws-load-balancer-controller.

  1. ALB provisioning fails entirely on non-EKS clusters when using TLS spec in the Ingress spec. The issue here is that Certificate is issued by LetsEncrypt & managed by cert-manager while aws-load-balancer-controller expects the Certificate to be present on AWS ACM.
  2. NLB TargerGroup provisioning would fail on non-EKS clusters because nodes do not contain providerID & therefore aws-load-balancer-controller cannot add instances in public subnets (which it auto-discovers via tags) to the TargetGroup on which the traffic is to be redirected from the NLB.
    • NLB get's created but suffers from TLS problem described in 1.)

Steps to reproduce

  1. Setup a non-EKS cluster through any k8s distro on AWS EC2. I used kubeadm.
  2. Install aws-load-balancer-controller by following the install instructions on the docs!
  3. For NLB: Try to expose any service through the aws-load-balancer-controller such that it would provison an NLB in the cloud.
    • NLB get's created but fails to work when service is annotated with TLS annotations from aws-load-balancer-controller.
    • TargetGroup is empty.
  4. For ALB: Try to expose an Ingress that redirects a service to a path. Ensure your service has loadBalancerClass set to anything other than service.k8s.aws/nlb such as loadBalancerClass: "none" to prevent aws-load-balancer-controller from provisioning an NLB instead of an ALB.
    • ALB isn't created in cloud when TLS spec in the Ingress spec.
    • TargetGroup is empty.

Expected outcome

Without above, securing internet exposed services running on AWS cloud is impossible on non-EKS clusters and therefore prevents the usage of ALB & NLB for production making non-EKS clusters impossible to automate within AWS cloud.

Environment

Additional Context:

  1. I've been through cert-manager/issues/333 and it seems like the issue was resolved for certificates issued by AWS ACM aka private certificates with aws-privateca-issuer. My use case is different as I'm seeking a way to upload the certificates issued by LetsEncrypt to AWS ACM and then tag my Ingress/Service resource with the Certificate ARN returned from the ImportCertificate API call.
  2. I've been through aws-load-balancer-controller/issues/3178 and aws-load-balancer-controller attemtps to auto discover certificate on ACM based on hostname in TLS. If 1. is resolved, this should be resolved to.
  3. I've been through aws-controllers-k8s/community/issues/482 and it seems the discussion for uploading certificates issued by other CA's was left unfinished.
oliviassss commented 4 months ago

@is-it-ayush, thanks for the details.

  1. as a security request, the controller requires the cert managed by ACM, so I don't think there's a way to bypass this as for now.
  2. we currently rely on the existence of ProviderID on node to resolve endpoints etc as here, for your environment, what's the node spec be like? Can you help us to understand the use case? I'm not sure whether we should relax this restirct on non-eks cluster.
is-it-ayush commented 4 months ago

Thank you for your response @oliviassss!

as a security request, the controller requires the cert managed by ACM, so I don't think there's a way to bypass this as for now.

To meet that security requirement I had a look @ ImportCertificate call. It seems like importing a certificate into ACM only requires,

Both of these are present in the kubernetes cluster as secrets and are managed by cert-manager. I can write a controller to call ImportCertificate endpoint & upload the certificate and it's private key when they're ready to the ACM's inventory. This way aws-load-balancer-controller would be able to auto-detect the certificate based on hostname for ALB. For NLB my controller would grab the received ARN after upload and annotate the Service such that NLB could find that certificate too. I've never written a controller before but would surely like to know what do you think of this solution?

we currently rely on the existence of ProviderID on node to resolve endpoints etc as here, for your environment, what's the node spec be like? Can you help us to understand the use case? I'm not sure whether we should relax this restirct on non-eks cluster.

This restriction makes it entirely impossible to automate AWS & Kubernetes running on non-EKS clusters on EC2. To give you a bit more context onto the previous sentence, I've tried other ways to expose my cluster services to the public internet "without relying on ELB". This does not work due to how VPC is designed. It leaves me with only 1 solution i.e. provision an ELB to expose the service from my non-EKS cluster to the Internet. In order to make those services safer, I would need TLS i.e. certificates to work on ELB (ALB/NLB). This is a pain point because aws-load-balancer-controller is relying,

For now my use case it to learn kubernetes on cloud and I'm just experimenting with everything. I plan on using this setup in production very soon if it works well. Here's the information you requested,

oliviassss commented 4 months ago

@is-it-ayush, thanks for the details. For the ACM one, I need to talk with our security engineer, we need security approval to do such feature, but would you help to open a new issue as a feature request for this? I've reopened this issue, and will discuss internally to see if we can relax the restrict for non-eks nodes. Thanks.

M00nF1sh commented 4 months ago

@is-it-ayush Adding to what @oliviassss, it's actually expected to have "providerID" set on non-eks nodes as well. e.g. if you create clusters using kops, then providerID will be setup correctly. Not sure how you used kubeadm to setup cluster, the providerID is setup in following ways:

  1. if your nodeName matches the "private-dns" name of instance in AWS, then the providerID will be automatically detected from nodeName and setup correctly.
  2. if your nodeName don't match the "private-dns" name of instance in AWS, you need to provide --provider-id to your kubelet configuration.

Otherwise, there isn't a way for AWS to know what's the instanceID for your instance.

is-it-ayush commented 4 months ago

For the ACM one, I need to talk with our security engineer, we need security approval to do such feature, but would you help to open a new issue as a feature request for this?

Sure @oliviassss! Here's the new issue isolating the certificate problem as a feature request.

I've reopened this https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3487#issue-1997547229, and will discuss internally to see if we can relax the restrict for non-eks nodes. Thanks.

Thank you!

is-it-ayush commented 4 months ago
  1. if your nodeName matches the "private-dns" name of instance in AWS, then the providerID will be automatically detected from nodeName and setup correctly.
  2. if your nodeName don't match the "private-dns" name of instance in AWS, you need to provide --provider-id to your kubelet configuration.

Thanks @M00nF1sh! Yep my node names do not really match the private-dns name. I also didn't set the providerId flag in kubelet configuration. I'll retry this a new cluster.

is-it-ayush commented 4 months ago

Hi @M00nF1sh! I retried the providerId problem on a fresh cluster where all nodes had private-dns as their hostname. It seems like kubeadm doesn't really add providerID on it's own and TargetGroup are still empty. Here's are the important log outputs!

  1. kubectl -n kube-system logs pod/aws-load-balancer-controller-5875bb459b-5qxxl: The last log line is important & you can see the node name matches the private-dns of the instance.
    {"level":"info","ts":"2024-05-23T06:47:55Z","msg":"version","GitVersion":"v2.8.0","GitCommit":"6afa4042433bd7b92b7ceb7807e99b51c0c3af23","BuildDate":"2024-05-17T20:09:33+0000"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"setup","msg":"adding health check for controller"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"setup","msg":"adding readiness check for webhook"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-v1-pod"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-v1-service"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-elbv2-k8s-aws-v1beta1-ingressclassparams"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-elbv2-k8s-aws-v1beta1-targetgroupbinding"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-networking-v1-ingress"}
    {"level":"info","ts":"2024-05-23T06:48:04Z","logger":"setup","msg":"starting podInfo repo"}
    {"level":"info","ts":"2024-05-23T06:48:06Z","logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
    {"level":"info","ts":"2024-05-23T06:48:06Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
    {"level":"info","ts":"2024-05-23T06:48:06Z","msg":"Starting server","kind":"health probe","addr":"[::]:61779"}
    {"level":"info","ts":"2024-05-23T06:48:06Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
    {"level":"info","ts":"2024-05-23T06:48:06Z","logger":"controller-runtime.webhook","msg":"Serving webhook server","host":"","port":9443}
    {"level":"info","ts":"2024-05-23T06:48:06Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
    I0523 06:48:07.037232       1 leaderelection.go:248] attempting to acquire leader lease kube-system/aws-load-balancer-controller-leader...
    I0523 06:48:07.070965       1 leaderelection.go:258] successfully acquired lease kube-system/aws-load-balancer-controller-leader
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"ingress","source":"channel source: 0xc000554320"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"ingress","source":"channel source: 0xc000554370"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"ingress","source":"kind source: *v1.Ingress"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"ingress","source":"kind source: *v1.Service"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"ingress","source":"channel source: 0xc0005543c0"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"ingress","source":"channel source: 0xc000554410"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"ingress","source":"kind source: *v1beta1.IngressClassParams"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"ingress","source":"kind source: *v1.IngressClass"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting Controller","controller":"ingress"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"targetGroupBinding","controllerGroup":"elbv2.k8s.aws","controllerKind":"TargetGroupBinding","source":"kind source: *v1beta1.TargetGroupBinding"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"targetGroupBinding","controllerGroup":"elbv2.k8s.aws","controllerKind":"TargetGroupBinding","source":"kind source: *v1.Service"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"targetGroupBinding","controllerGroup":"elbv2.k8s.aws","controllerKind":"TargetGroupBinding","source":"kind source: *v1.Endpoints"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"targetGroupBinding","controllerGroup":"elbv2.k8s.aws","controllerKind":"TargetGroupBinding","source":"kind source: *v1.Node"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting Controller","controller":"targetGroupBinding","controllerGroup":"elbv2.k8s.aws","controllerKind":"TargetGroupBinding"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting EventSource","controller":"service","source":"kind source: *v1.Service"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting Controller","controller":"service"}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting workers","controller":"ingress","worker count":3}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting workers","controller":"service","worker count":3}
    {"level":"info","ts":"2024-05-23T06:48:07Z","msg":"Starting workers","controller":"targetGroupBinding","controllerGroup":"elbv2.k8s.aws","controllerKind":"TargetGroupBinding","worker count":3}
    {"level":"info","ts":"2024-05-23T06:56:59Z","msg":"service already has loadBalancerClass, skipping","service":"nginx-svc","loadBalancerClass":"service.k8s.aws/nlb"}
    {"level":"info","ts":"2024-05-23T06:57:06Z","logger":"backend-sg-provider","msg":"created SecurityGroup","name":"k8s-traffic-kubernetes-94abcb2d27","id":"sg-0ec573b50bad084de"}
    {"level":"info","ts":"2024-05-23T06:57:06Z","logger":"controllers.service","msg":"successfully built model","model":"{\"id\":\"default/nginx-svc\",\"resources\":{\"AWS::EC2::SecurityGroup\":{\"ManagedLBSecurityGroup\":{\"spec\":{\"groupName\":\"k8s-default-nginxsvc-9601482c4d\",\"description\":\"[k8s] Managed SecurityGroup for LoadBalancer\",\"ingress\":[{\"ipProtocol\":\"tcp\",\"fromPort\":443,\"toPort\":443,\"ipRanges\":[{\"cidrIP\":\"0.0.0.0/0\"}]}]}}},\"AWS::ElasticLoadBalancingV2::Listener\":{\"443\":{\"spec\":{\"loadBalancerARN\":{\"$ref\":\"#/resources/AWS::ElasticLoadBalancingV2::LoadBalancer/LoadBalancer/status/loadBalancerARN\"},\"port\":443,\"protocol\":\"TCP\",\"defaultActions\":[{\"type\":\"forward\",\"forwardConfig\":{\"targetGroups\":[{\"targetGroupARN\":{\"$ref\":\"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/default/nginx-svc:443/status/targetGroupARN\"}}]}}]}}},\"AWS::ElasticLoadBalancingV2::LoadBalancer\":{\"LoadBalancer\":{\"spec\":{\"name\":\"nginx-svc-balancer\",\"type\":\"network\",\"scheme\":\"internet-facing\",\"ipAddressType\":\"ipv4\",\"subnetMapping\":[{\"subnetID\":\"subnet-008a79b5595c87150\"},{\"subnetID\":\"subnet-01e410654881e02d5\"},{\"subnetID\":\"subnet-08eef5412ecf3994d\"}],\"securityGroups\":[{\"$ref\":\"#/resources/AWS::EC2::SecurityGroup/ManagedLBSecurityGroup/status/groupID\"},\"sg-0ec573b50bad084de\"]}}},\"AWS::ElasticLoadBalancingV2::TargetGroup\":{\"default/nginx-svc:443\":{\"spec\":{\"name\":\"k8s-default-nginxsvc-3b278deb2a\",\"targetType\":\"instance\",\"port\":32646,\"protocol\":\"TCP\",\"ipAddressType\":\"ipv4\",\"healthCheckConfig\":{\"port\":\"traffic-port\",\"protocol\":\"TCP\",\"intervalSeconds\":10,\"timeoutSeconds\":10,\"healthyThresholdCount\":3,\"unhealthyThresholdCount\":3},\"targetGroupAttributes\":[{\"key\":\"proxy_protocol_v2.enabled\",\"value\":\"false\"}]}}},\"K8S::ElasticLoadBalancingV2::TargetGroupBinding\":{\"default/nginx-svc:443\":{\"spec\":{\"template\":{\"metadata\":{\"name\":\"k8s-default-nginxsvc-3b278deb2a\",\"namespace\":\"default\",\"creationTimestamp\":null},\"spec\":{\"targetGroupARN\":{\"$ref\":\"#/resources/AWS::ElasticLoadBalancingV2::TargetGroup/default/nginx-svc:443/status/targetGroupARN\"},\"targetType\":\"instance\",\"serviceRef\":{\"name\":\"nginx-svc\",\"port\":443},\"networking\":{\"ingress\":[{\"from\":[{\"securityGroup\":{\"groupID\":\"sg-0ec573b50bad084de\"}}],\"ports\":[{\"protocol\":\"TCP\",\"port\":32646}]}]},\"ipAddressType\":\"ipv4\",\"vpcID\":\"vpc-055f044dfeb185021\"}}}}}}}"}
    {"level":"info","ts":"2024-05-23T06:57:07Z","logger":"controllers.service","msg":"creating securityGroup","resourceID":"ManagedLBSecurityGroup"}
    {"level":"info","ts":"2024-05-23T06:57:07Z","logger":"controllers.service","msg":"created securityGroup","resourceID":"ManagedLBSecurityGroup","securityGroupID":"sg-035464079c9ab32d7"}
    {"level":"info","ts":"2024-05-23T06:57:07Z","msg":"authorizing securityGroup ingress","securityGroupID":"sg-035464079c9ab32d7","permission":[{"FromPort":443,"IpProtocol":"tcp","IpRanges":[{"CidrIp":"0.0.0.0/0","Description":""}],"Ipv6Ranges":null,"PrefixListIds":null,"ToPort":443,"UserIdGroupPairs":null}]}
    {"level":"info","ts":"2024-05-23T06:57:08Z","msg":"authorized securityGroup ingress","securityGroupID":"sg-035464079c9ab32d7"}
    {"level":"info","ts":"2024-05-23T06:57:08Z","logger":"controllers.service","msg":"creating targetGroup","stackID":"default/nginx-svc","resourceID":"default/nginx-svc:443"}
    {"level":"info","ts":"2024-05-23T06:57:08Z","logger":"controllers.service","msg":"created targetGroup","stackID":"default/nginx-svc","resourceID":"default/nginx-svc:443","arn":"arn:aws:elasticloadbalancing:ap-south-1:211125623171:targetgroup/k8s-default-nginxsvc-3b278deb2a/fee2c00c889d33a2"}
    {"level":"info","ts":"2024-05-23T06:57:08Z","logger":"controllers.service","msg":"creating loadBalancer","stackID":"default/nginx-svc","resourceID":"LoadBalancer"}
    {"level":"info","ts":"2024-05-23T06:57:08Z","logger":"controllers.service","msg":"created loadBalancer","stackID":"default/nginx-svc","resourceID":"LoadBalancer","arn":"arn:aws:elasticloadbalancing:ap-south-1:211125623171:loadbalancer/net/nginx-svc-balancer/150a966b036553c5"}
    {"level":"info","ts":"2024-05-23T06:57:09Z","logger":"controllers.service","msg":"creating listener","stackID":"default/nginx-svc","resourceID":"443"}
    {"level":"info","ts":"2024-05-23T06:57:09Z","logger":"controllers.service","msg":"created listener","stackID":"default/nginx-svc","resourceID":"443","arn":"arn:aws:elasticloadbalancing:ap-south-1:211125623171:listener/net/nginx-svc-balancer/150a966b036553c5/6a8c23515615520e"}
    {"level":"info","ts":"2024-05-23T06:57:09Z","logger":"controllers.service","msg":"creating targetGroupBinding","stackID":"default/nginx-svc","resourceID":"default/nginx-svc:443"}
    {"level":"info","ts":"2024-05-23T06:57:15Z","logger":"controllers.service","msg":"created targetGroupBinding","stackID":"default/nginx-svc","resourceID":"default/nginx-svc:443","targetGroupBinding":{"namespace":"default","name":"k8s-default-nginxsvc-3b278deb2a"}}
    {"level":"info","ts":"2024-05-23T06:57:15Z","logger":"controllers.service","msg":"successfully deployed model","service":{"namespace":"default","name":"nginx-svc"}}
    {"level":"error","ts":"2024-05-23T06:57:15Z","msg":"Reconciler error","controller":"targetGroupBinding","controllerGroup":"elbv2.k8s.aws","controllerKind":"TargetGroupBinding","TargetGroupBinding":{"name":"k8s-default-nginxsvc-3b278deb2a","namespace":"default"},"namespace":"default","name":"k8s-default-nginxsvc-3b278deb2a","reconcileID":"427f6644-5a81-40f5-ada2-d869f3420217","error":"providerID is not specified for node: ip-10-0-0-159.ap-south-1.compute.internal"}
  2. kubectl get nodes -o json | jq '.items[].spec': The providerId should be present here but I believe kubeadm won't set it up itself so this might the cause. I think aws-load-balancer-controller should take vanilla k8s distro i.e. kubeadm into account.
    {
    "podCIDR": "69.96.1.0/24",
    "podCIDRs": [
    "69.96.1.0/24"
    ]
    }
    {
    "podCIDR": "69.96.0.0/24",
    "podCIDRs": [
    "69.96.0.0/24"
    ],
    "taints": [
    {
      "effect": "NoSchedule",
      "key": "node-role.kubernetes.io/control-plane"
    }
    ]
    }
  3. kubectl get nodes -o wide: The NAME is exactly the same as hostname on the nodes.
    NAME                                         STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION         CONTAINER-RUNTIME
    ip-10-0-0-159.ap-south-1.compute.internal    Ready    worker          44m   v1.30.1   10.0.0.159    <none>        Debian GNU/Linux 12 (bookworm)   6.1.0-21-cloud-amd64   containerd://1.6.31
    ip-10-0-96-147.ap-south-1.compute.internal   Ready    control-plane   63m   v1.30.1   10.0.96.147   <none>        Debian GNU/Linux 12 (bookworm)   6.1.0-21-cloud-amd64   containerd://1.6.31
  4. kubeadm init --config ./cluster-config.yaml: The configuration I used to initialise the cluster.
    apiVersion: kubeadm.k8s.io/v1beta3
    kind: ClusterConfiguration
    kubernetesVersion: v1.30.1
    clusterName: "kubernetes"
    networking:
    podSubnet: "69.96.0.0/16"
    serviceSubnet: "69.97.0.0/16
is-it-ayush commented 4 months ago

Hey @oliviassss! I apologise for mentioning but any updates on this? I'm currently blocking a project because I'm unable to integrate AWS and self-managed k8s cluster setup by kubeadm partly because of this issue. : )

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 days ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten