kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.42k stars 8.24k forks source link

ingress-nginx-controller 1.5.1 is not registering new nodes on ELB #9367

Closed johnjeffers closed 1 year ago

johnjeffers commented 1 year ago

What happened:

After upgrading to nginx-ingress 1.5.1, EC2 instance nodes are no longer being registered to the AWS load balancer when using a Classic ELB. As a result, no instances may be registered to the load balancer, and ingresses using the ingress controller return 503 Service Unavailable: Back-end server is at capacity.

If you manually update any seemingly any value in the ingress-nginx-controller service, this triggers the ELB to update, but if you don't do this it will never reconcile the node changes.

What you expected to happen:

Instances are registered to the load balancer are necessary and ingresses work normally.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
  Release:       v1.5.1
  Build:         d003aae913cc25f375deb74f898c7f3c65c06f05
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6

Kubernetes version (use kubectl version):

1.24

Environment:

How was the ingress-nginx-controller installed:

Customized helm chart

How to reproduce this issue:

This problem is specific to AWS load balancers and won't be reproducible in minikube/kind

Anything else we need to know:

k8s-ci-robot commented 1 year ago

@johnjeffers: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
johnjeffers commented 1 year ago

It's possible this may be an EKS/Kubernetes issue since it seems to be more related to the LoadBalancer service type than anything to do with the actual ingress controller. I've opened a support case and will hopefully report back soon with more info.

mtkk-hara commented 1 year ago

Even with ingress-nginx-controller 1.4.0 , I face the same error.

I also tried downgrading to the following versions, but the result did not change ... ingress-nginx-controller 1.3.0 ingress-nginx-controller 1.2.0

Kubernetes version (use kubectl version):

1.24

Environment:

johnjeffers commented 1 year ago

This appears to be a regression in Kubernetes 1.24. When externalTrafficPolicy is set to Local then new nodes will not be automatically registered with the ELB. It doesn't matter what version of the ingress controller you're using, because it's a problem with the in-tree LoadBalancer service type. If you change externalTrafficPolicy to Cluster then it will work as expected.

It's worth noting that the nlb-with-tls-termination manifest sets externalTrafficPolicy: local

I suspect this is only a problem with Classic ELBs but I haven't tested with other load balancer types.

strongjz commented 1 year ago

Do you know what version of the cloud controller eks 1.24 deploys?

Aws docs do recommend nlb

We recommend that you use version 2.4.4 or later of. the AWS Load Balancer Controller instead of the AWS cloud provider load balancer controller. The AWS Load Balancer Controller creates AWS Network Load Balancers, but doesn't create AWS Classic Load Balancers.

https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html

I'll investigate this further as well.

/assign @strongjz

johnjeffers commented 1 year ago

@strongjz thank you!

I would like to convert to an NLB but unfortunately I ran into a different issue with that. All of my ingresses have the nginx.ingress.kubernetes.io/force-ssl-redirect: "true" annotation set, and while this works with an AWS classic ELB, when I switch to an NLB it causes a redirect loop. So it seems my only zero-downtime option is to set up another instance of the ingress controller, using a different IngressClass, and migrate all my apps over.

Do you know what version of the cloud controller eks 1.24 deploys?

I don't, sorry. I still have an AWS support ticket open, so I will ask.

DekusDenial commented 1 year ago

See https://github.com/kubernetes/ingress-nginx/issues/9041#issuecomment-1252255678

johnjeffers commented 1 year ago

@strongjz AWS support got back to me and said:

the version of cloud controller being used in EKSv1.24 is cloud-controller-manager v1.24.2.

DekusDenial commented 1 year ago

@johnjeffers did they also say when the fix for the in-tree cloud controller is going to roll out? It’s presumably US holiday now and I don’t think they are gonna have it fixed until next year hopefully before February. The EKS platform release cadence is usually about every month. If they don’t get it in January, I will seriously gonna consider moving to AWS Load Balancer Controller for reconciliation.

johnjeffers commented 1 year ago

@johnjeffers did they also say when the fix for the in-tree cloud controller is going to roll out?

No, you can never pin them down on something like that. I know they're aware that this is a big problem, but no idea how fast they will move on this.

DekusDenial commented 1 year ago

@johnjeffers did they also say when the fix for the in-tree cloud controller is going to roll out?

No, you can never pin them down on something like that. I know they're aware that this is a big problem, but no idea how fast they will move on this.

Talked to the TAM there at AWS.

TLDR; They are gonna work with the EKS service team about prioritizing this before EOW. My understanding is that it’s holiday, people are out, don’t expect it to land any time soon. lol

MonkzCode commented 1 year ago

@johnjeffers did they also say when the fix for the in-tree cloud controller is going to roll out?

No, you can never pin them down on something like that. I know they're aware that this is a big problem, but no idea how fast they will move on this.

Talked to the TAM there at AWS.

TLDR; They are gonna work with the EKS service team about prioritizing this before EOW. My understanding is that it’s holiday, people are out, don’t expect it to land any time soon. lol

Guys, maybe You have any updates on this?

MonkzCode commented 1 year ago

@DekusDenial sorry for trouble, but maybe You have some news?

DekusDenial commented 1 year ago

@DekusDenial sorry for trouble, but maybe You have some news?

The TAM gonna give me update from internal team sometimes this week

sorind-broadsign commented 1 year ago

Hello guys, I'm running into the same problem with our setup. Is there any update on this? Thank you !

DekusDenial commented 1 year ago

I gave up and gonna switch to alb controller as the out of tree controller manager

joshuaganger commented 1 year ago

This does appear to affect NLB as well. I'm running into this issue on EKS v1.24.7 with ingress-nginx 1.5.1 deploying an NLB. Initial deployment works, but when externalTrafficPolicy is set to Local any time a node is replaced it never gets registered in the target group. Setting externalTrafficPolicy to Cluster causes the nodes to be correctly registered, but the nginx.ingress.kubernetes.io/whitelist-source-range annotation no longer works as expected.

MonkzCode commented 1 year ago

This does appear to affect NLB as well. I'm running into this issue on EKS v1.24.7 with ingress-nginx 1.5.1 deploying an NLB. Initial deployment works, but when externalTrafficPolicy is set to Local any time a node is replaced it never gets registered in the target group. Setting externalTrafficPolicy to Cluster causes the nodes to be correctly registered, but the nginx.ingress.kubernetes.io/whitelist-source-range annotation no longer works as expected.

Maybe you can use loadBalancerSourceRanges ?

joshuaganger commented 1 year ago

Maybe you can use loadBalancerSourceRanges ? Correct me if I'm wrong, but this would be applied on the ingress-nginx service? That would be a cluster-wide allow-list, but I need this functionality to be configured per ingress.

Also, one important update. I decided to remove ingress-nginx and implement Traefik ingress controller. Traefik is showing the same behavior. If I set externalTrafficPolicy: Local nodes no longer register in the target group after being replaced. This leads me to believe it's not actually an ingress-nginx issue. I've found scattered discussion about this on various forums, email lists etc, but haven't come across an explanation or resolution.

DekusDenial commented 1 year ago

See #9041 (comment)

I gave up and gonna switch to alb controller as the out of tree controller manager

Long story short, the aws out of tree cloud controller manager introduced a change which caused this misbehavior you all see reported here. But the fix was merged in and tagged for v1.26 which corresponds to k8s 1.26, and I don’t think they backport the fix to previous versions < 1.26. And EKS 1.26 won’t be released for at least another quarter or so this leaves us all hanging looking for workaround or alternative. My workaround is install aws load balancer controller and use it only as the external cloud controller manager for nginx-ingress load balancer/ASG/Target group. But doing so will not preserve the existing LB, so DNS change is expected as a result. Of course you can switch to aws alb controller entirely for everything ingress and service exposure to ELB.

chenshap commented 1 year ago

does anyone found a workaround (not moving to aws lb controller)? this is a huge change...

DekusDenial commented 1 year ago

does anyone found a workaround?

  1. Don’t upgrade to the latest eks platform version on 1.23
  2. Don’t upgrade beyond 1.23 and before 1.26
  3. Wait for 1.26
  4. Write a piece of controller that watches for new node in cluster and add/refresh it in all target groups associated with the nlb of your interest. This is kinda mimicking the legacy cloud controller code.

(not moving to aws lb controller)? this is a huge change...

not really huge change, you just install it via helm/manifest, and change couple annotations on the nginx controller service. But need to update any dns reference to the old nlb. Downtime can be avoided but need to pay attention to existing resources like SG which could be stale.

dictcp commented 1 year ago

We have recently submitted a AWS support ticket on this issue as well. Their support was giving switching to NLB (via ALB ingress controller) as workaround. They mentioned the EKS are aware of it, but no ETA on the bugfix (and it is already Feb, 2 months+ from the original discussion).

--

and per https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html

When you create a Kubernetes Service of type LoadBalancer, the AWS cloud provider load balancer controller creates AWS Classic Load Balancers by default, but can also create AWS Network Load Balancers. This controller is only receiving critical bug fixes in the future.

It seems like the EKS team does not consider it is a critical bug 🤔

DekusDenial commented 1 year ago

It seems like the EKS team does not consider it is a critical bug 🤔

That’s the legacy cloud controller.

They mentioned the EKS are aware of it, but no ETA on the bugfix

I know both the TAM and the manager behind the team but still have no first hand info on why the fix won’t come for <1.26 To me it just a cherry pick and mark a new release since EKS auto rolls the control plane platform version in most cases

joshuaganger commented 1 year ago

but still have no first hand info on why the fix won’t come for <1.26

Was the fix for this released in 1.25 after all? I haven't done extensive testing, but after upgrading my cluster to 1.25 I'm now seeing nodes correctly registered in Target Groups on the NLB. I understand the underlying bug was in the Cloud Controller Manager and not ingress-nginx itself. I've tested the behavior with both ingress-nginx and Traefik.

DekusDenial commented 1 year ago

Was the fix for this released in 1.25 after all?

Certainly looks like the case, after a whole quarter gone by. Can you check the cloud controller manager version on your 1.25 cluster? I wonder if the EKS team actually forked off the upstream repo and applied the fix there. You can clearly see the fix included for 1.26.0 release but not before that. I don't see any new platform version for 1.24 tough, so I guess the quick fix for people is to upgrade to 1.25.

joshuaganger commented 1 year ago

Can you check the cloud controller manager version on your 1.25 cluster?

I don't think that info is readily available. I see another comment earlier that mentions needing to get the answer from AWS support. I'll see if I can get that info and report back.

DekusDenial commented 1 year ago

I don't think that info is readily available.

It only shows up on the very first log of the cloud controller manager log stream if you have that cluster control plane log enabled. You only see it once unless its pods on the master nodes got recreated somehow.

We can close this issue as people have already moved on. @johnjeffers

romapi commented 1 year ago

Seems like this problem has been fixed somewhere else. We have tried in the end of 2022 v1.24 and it did not register any nodes in target group. Today it worked with v1.24. At first I thought maybe AWS have applied fixes in add-on's (e.g. aws-vpc-cni) but I could not find any evidence of that in change logs.

However, comparing Kubernetes Cloud Provider AWS' releases 1.24.0 and latest 1.24.4 (https://github.com/kubernetes/cloud-provider-aws/compare/v1.24.0...v1.24.4) there is one commit regarding this issue: https://github.com/kubernetes/cloud-provider-aws/commit/16d121e0d7da1358a773b8c0c3de7a596a7343b0. So you are right @joshuaganger

I would recommend for all those who have postponed v1.24 upgrade because of this issue and are waiting for an update or status change here (like me) to go and try again.

ctradar commented 1 year ago

I am also experiencing same issue on EKS 1.22 with a NLB

Nginx Ingress Controller version v1.4.0

@johnjeffers you mentioned a workaround (below). i tried updating the k8s service and the deployment with a dummy update but that did not trigger the nodes registration with the target group of NLB. Can you please specify what you did to trigger the ELB update ?

If you manually update any seemingly any value in the ingress-nginx-controller service, this triggers the ELB to update, but if you don't do this it will never reconcile the node changes.

DekusDenial commented 1 year ago

@ctradar upgrade your EKS to 1.24 or up because AWS only back port the solution to as low as 1.24. EKS 1.24+ are running with out-of-tree aws cloud providers that already have the fix. That’s your only no brainer way plus 1.22 is being deprecated in June. Read up the comments above. It’s been resolved. @strongjz do you mind closing this issue?

ctradar commented 1 year ago

@DekusDenial I get that but need a quick fix, if available, for the short term until we are able to plan an upgrade to EKS 1.24 without any impact to our services.

DekusDenial commented 1 year ago

@ctradar no quick fix for this unless you are willing to add all your nodes back to the target groups every time those ingress nginx controller pods are changed. Another undesirable fix is to move to aws alb controller but you will lose the original LB associated with the service or you need to switch your DNS to the new LB created by that.

ctradar commented 1 year ago

@DekusDenial I see! So this occurs when the nginx controller pods are re-created i.e when they move to a new node etc

We run spot instances so these are ephemeral and nginx controller pods eventually are re-created when they are moved to a new node. Gotcha. Thanks!

DekusDenial commented 1 year ago

@ctradar correct. In your case, you could tap into EC2 lifecycle event and do your own hack there assuming you have no massive eviction of pods or termination of spot. Upgrade to 1.23 and 1.24 is really simple just need to prepare for api deprecation and mainly containerd

johnjeffers commented 1 year ago

I agree that upgrading is the right call here, but I wouldn’t say it’s simple. The switch to containerd will probably break things, and it definitely makes troubleshooting locally on the nodes more difficult if you’re used to Docker commands. (I’m aware of nerdctl but afaik that’s not included by default on EKS AMIs.)

rikatz commented 1 year ago

Can we close this issue? I don't think it is related to ingress-nginx and more on how AWS is doing things, right? I will be closing it, feel free to reopen or ping me on Slack if this is not the case, and thanks for all the investigation here :) /close

k8s-ci-robot commented 1 year ago

@rikatz: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/9367#issuecomment-1612917511): >Can we close this issue? I don't think it is related to ingress-nginx and more on how AWS is doing things, right? >I will be closing it, feel free to reopen or ping me on Slack if this is not the case, and thanks for all the investigation here :) >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.