kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.38k stars 8.23k forks source link

AWS - Randomly unhealty nodes in target groups #9990

Closed bjtox closed 1 month ago

bjtox commented 1 year ago

Hi, i try to implement a nginx-ingress controller on my EKS installation. i'm try to move on a new fresh installation on aws. i'm able to provide the NLB and the target group but seem not all nodes pass the health check, seem ramdomly fail, currently there are only 2 nodes on 5 availabe on my cluster.

the issue is the same of this one 8312

we move our application from k8s 1.22 to 1.26. We use the chart version 4.6.1 and we hope all nodes going healty.

Seem the node port on nodes are unavailabe for some reason i can't understand

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller Release: v1.7.1 Build: f48b03be54031491e78472bcf3aa026a81e1ffd3 Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.21.6

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.4-eks-0a21954", GitCommit:"4a3479673cb6d9b63f1c69a67b57de30a4d9b781", GitTreeState:"clean", BuildDate:"2023-04-15T00:33:09Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

Environment: QA

How to reproduce this issue:

Anything else we need to know: no other information are availabe

Thanks in advance best regards

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 1 year ago

/remove-kind bug

Is this related https://github.com/kubernetes/ingress-nginx/issues/9367

bjtox commented 1 year ago

thank for reply @longwuyuan the issue linked is different for me, in my case EC2 Instaces are registerd on Target Group but they are unhealty. i've check if it was a network issue but nodes in same subnet had 2 different status (healty and unhealty)

longwuyuan commented 1 year ago

please show kubectl -n ingress-nginx get svc -o yaml | grep -i aws

bjtox commented 1 year ago

here the conten

      service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: TCP
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      - hostname: a8e842bcf9d14473ea8460a067058c46-f7c4d42e3047f41b.elb.eu-south-1.amazonaws.com
bjtox commented 1 year ago

is it possible to set externalTrafficPolicy to local ?

just to add context the problem is the same reported in this post https://stackoverflow.com/questions/61183167/kubernetes-issue-with-nodeport-connectivity

longwuyuan commented 1 year ago

I think there is a healthz path related annotation required. Can you check docs

bjtox commented 1 year ago

but tcp healtcheck don't have a path, am i wrong?

longwuyuan commented 1 year ago

I am not sure. I think I have seen some comment about path. I am checking

longwuyuan commented 1 year ago

Sorry, it was about AKS and not EKS

longwuyuan commented 1 year ago

If you can edit your issue description and improve it, maybe more useful data will be available for debugging.

bjtox commented 1 year ago

i'm not able to provde you another info, seem something goes down on K8, so that port are unavailabe on the host

bjtox commented 1 year ago

@longwuyuan the issu is the same reported here https://github.com/kubernetes/ingress-nginx/issues/8312

longwuyuan commented 1 year ago

I am wondering if this is related https://github.com/kubernetes/ingress-nginx/issues/9367

On Thu, 25 May, 2023, 1:20 pm Antonio Bitonti, @.***> wrote:

@longwuyuan https://github.com/longwuyuan the issu is the same reported here #8312 https://github.com/kubernetes/ingress-nginx/issues/8312

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/9990#issuecomment-1562443314, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWS47ALZUGBUWZEEUELXH4FOFANCNFSM6AAAAAAYMDOW3A . You are receiving this because you were mentioned.Message ID: @.***>

sebastienrospars commented 1 year ago

Hi , I have the same problem, sometimes I have 0 healthy node in the target group and a few minutes later I have one or two node up. Have you found a solution to this problem or do you still have this problem @bjtox ? Thanks

minhhieu76qng commented 1 year ago

@sebastienrospars Yeah, I faced with the same problem. I installed ingress-nginx with Helm chart. I tried to install using install.yaml manifest in the documentation and it works. Then I compared between helm chart values and manifest. Therefore, I found the externalTrafficPolicy for helm chart is not configured so it will get default value (Cluster) and in the manifest, it is Local. So I added to values.yaml of chart: controler.service.externalTrafficPolicy: Local. => The problem was fixed now.

I have no idea about the difference, is it a mistake? @longwuyuan

tudor-pop-mimedia commented 2 months ago

this didn't fix my problems. However I think the solution can be found here. But the explanation seems to be true for both Cluster and Local. If I don't have an ingress-nginx pod running on a node, then the node is out-of-service in NLB

dmitry-medvedev1 commented 1 month ago

Hello everyone. Faced the same situation - only one node in target group (behind AWS load balancer) is healthy, all other are unhealthy. I used this yaml to be able to deploy AWS load balancer.

My question is: why ingress-nginx-controller is defined by default as Deployment, not as DaemonSet? Isn't that a single point of failure in case there are more than one node in cluster?

longwuyuan commented 1 month ago

Read AWS Loadbalancer Controller docs and try this ;

The ingress-nginx-controller helm-chart is a generic install out of the box. The default set of helm values is not configured for installation on any infra provider. The annotations that are applicable to the cloud provider must be customized by the users.
See [AWS LB Controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/service/annotations/).
Examples of some annotations needed for the service resource of --type LoadBalancer on AWS are below:

  annotations:
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "true"
    service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-something1 sg-something2"
    service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name: "somebucket"
    service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-prefix: "ingress-nginx"
    service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval: "5"

/close

k8s-ci-robot commented 1 month ago

@longwuyuan: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/9990#issuecomment-2345842536): >Read AWS Loadbalancer Controller docs and try this ; >``` >The ingress-nginx-controller helm-chart is a generic install out of the box. The default set of helm values is not configured for installation on any infra provider. The annotations that are applicable to the cloud provider must be customized by the users. >See [AWS LB Controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/service/annotations/). >Examples of some annotations needed for the service resource of --type LoadBalancer on AWS are below: > > annotations: > service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing" > service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp > service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true" > service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip" > service.beta.kubernetes.io/aws-load-balancer-type: nlb > service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "true" > service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true" > service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-something1 sg-something2" > service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name: "somebucket" > service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-prefix: "ingress-nginx" > service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval: "5" >``` > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 1 month ago

https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/service/annotations/