AWS - Randomly unhealty nodes in target groups

bjtox commented 1 year ago

Hi, i try to implement a nginx-ingress controller on my EKS installation. i'm try to move on a new fresh installation on aws. i'm able to provide the NLB and the target group but seem not all nodes pass the health check, seem ramdomly fail, currently there are only 2 nodes on 5 availabe on my cluster.

the issue is the same of this one 8312

we move our application from k8s 1.22 to 1.26. We use the chart version 4.6.1 and we hope all nodes going healty.

Seem the node port on nodes are unavailabe for some reason i can't understand

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller Release: v1.7.1 Build: f48b03be54031491e78472bcf3aa026a81e1ffd3 Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.21.6

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.4-eks-0a21954", GitCommit:"4a3479673cb6d9b63f1c69a67b57de30a4d9b781", GitTreeState:"clean", BuildDate:"2023-04-15T00:33:09Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

Environment: QA

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release):
- Amazon Linux 2
Install tools:
- installed using helm
Basic cluster related info:
- kubectl version
- - Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.12", GitCommit:"b058e1760c79f46a834ba59bd7a3486ecf28237d", GitTreeState:"clean", BuildDate:"2022-07-13T14:59:18Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
- kubectl get nodes -o wide
- - ip-10-176-0-218.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.218 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19 ip-10-176-0-227.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.227 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19 ip-10-176-0-77.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.77 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19 ip-10-176-1-124.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.1.124 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19 ip-10-176-1-68.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.1.68 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
How was the ingress-nginx-controller installed:
- If helm was used then please show output of helm ls -A | grep -i ingress
- - s-oms-ingress s-oms 1 2023-05-23 17:34:33.603173818 +0200 CEST deployed s-oms-ingress-4.0.1 4.0.1
- If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>
- If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
- if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances
Current State of the controller:
- kubectl describe ingressclasses
- - Name: nginx Labels: app.kubernetes.io/component=controller app.kubernetes.io/instance=s-oms-ingress app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=ingressnginx app.kubernetes.io/part-of=ingressnginx app.kubernetes.io/version=1.7.1 helm.sh/chart=ingressnginx-4.6.1 Annotations: meta.helm.sh/release-name: s-oms-ingress meta.helm.sh/release-namespace: s-oms Controller: k8s.io/ingress-nginx Events:
- kubectl -n <ingresscontrollernamespace> get all -A -o wide
- kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
- kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Current state of ingress object, if applicable:
- kubectl -n <appnnamespace> get all,ing -o wide
- kubectl -n <appnamespace> describe ing <ingressname>
- If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag

Others:

Any other related information like ;
copy/paste of the snippet (if applicable)

ingressnginx:
controller:
ingressClassResource:
name: nginx
replicaCount: 3
service:
internalTrafficPolicy: local
ipFamilies: false
ipFamilyPolicy: false
# type: ClusterIP
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
# The number of successive successful health checks required for a backend to be considered healthy for traffic. Defaults to 2, must be between 2 and 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
# The number of unsuccessful health checks required for a backend to be considered unhealthy for traffic. Defaults to 6, must be between 2 and 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
# The approximate interval, in seconds, between health checks of an individual instance. Defaults to 10, must be between 5 and 300
service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
# The amount of time, in seconds, during which no response means a failed health check. This value must be less than the service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval value. Defaults to 5, must be between 2 and 60
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: TCP
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
# can be integer or traffic-port
config:
enable-modsecurity: "true"
enable-owasp-modsecurity-crs: "true"

How to reproduce this issue:

Anything else we need to know: no other information are availabe

Thanks in advance best regards

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 1 year ago

/remove-kind bug

bjtox commented 1 year ago

thank for reply @longwuyuan the issue linked is different for me, in my case EC2 Instaces are registerd on Target Group but they are unhealty. i've check if it was a network issue but nodes in same subnet had 2 different status (healty and unhealty)

longwuyuan commented 1 year ago

please show kubectl -n ingress-nginx get svc -o yaml | grep -i aws

bjtox commented 1 year ago

here the conten

      service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: TCP
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      - hostname: a8e842bcf9d14473ea8460a067058c46-f7c4d42e3047f41b.elb.eu-south-1.amazonaws.com

bjtox commented 1 year ago

Additiona info Checking connection to the nodes i saw the node port are availabe only for a short period, i'v just check with a telnet command from another vm in same network

is it possible to set externalTrafficPolicy to local ?

just to add context the problem is the same reported in this post https://stackoverflow.com/questions/61183167/kubernetes-issue-with-nodeport-connectivity

longwuyuan commented 1 year ago

I think there is a healthz path related annotation required. Can you check docs

bjtox commented 1 year ago

but tcp healtcheck don't have a path, am i wrong?

longwuyuan commented 1 year ago

I am not sure. I think I have seen some comment about path. I am checking

longwuyuan commented 1 year ago

Sorry, it was about AKS and not EKS

longwuyuan commented 1 year ago

If you can edit your issue description and improve it, maybe more useful data will be available for debugging.

Please answer all questions asked in the issue template
Please format the information as per markdown

bjtox commented 1 year ago

i'm not able to provde you another info, seem something goes down on K8, so that port are unavailabe on the host

bjtox commented 1 year ago

@longwuyuan the issu is the same reported here https://github.com/kubernetes/ingress-nginx/issues/8312

longwuyuan commented 1 year ago

I am wondering if this is related https://github.com/kubernetes/ingress-nginx/issues/9367

On Thu, 25 May, 2023, 1:20 pm Antonio Bitonti, @.***> wrote:

@longwuyuan https://github.com/longwuyuan the issu is the same reported here #8312 https://github.com/kubernetes/ingress-nginx/issues/8312

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/9990#issuecomment-1562443314, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWS47ALZUGBUWZEEUELXH4FOFANCNFSM6AAAAAAYMDOW3A . You are receiving this because you were mentioned.Message ID: @.***>

sebastienrospars commented 1 year ago

Hi , I have the same problem, sometimes I have 0 healthy node in the target group and a few minutes later I have one or two node up. Have you found a solution to this problem or do you still have this problem @bjtox ? Thanks

minhhieu76qng commented 1 year ago

@sebastienrospars Yeah, I faced with the same problem. I installed ingress-nginx with Helm chart. I tried to install using install.yaml manifest in the documentation and it works. Then I compared between helm chart values and manifest. Therefore, I found the externalTrafficPolicy for helm chart is not configured so it will get default value (Cluster) and in the manifest, it is Local. So I added to values.yaml of chart: controler.service.externalTrafficPolicy: Local. => The problem was fixed now.

I have no idea about the difference, is it a mistake? @longwuyuan

tudor-pop-mimedia commented 2 months ago

this didn't fix my problems. However I think the solution can be found here. But the explanation seems to be true for both Cluster and Local. If I don't have an ingress-nginx pod running on a node, then the node is out-of-service in NLB

dmitry-medvedev1 commented 1 month ago

Hello everyone. Faced the same situation - only one node in target group (behind AWS load balancer) is healthy, all other are unhealthy. I used this yaml to be able to deploy AWS load balancer.

My question is: why ingress-nginx-controller is defined by default as Deployment, not as DaemonSet? Isn't that a single point of failure in case there are more than one node in cluster?

longwuyuan commented 1 month ago

Read AWS Loadbalancer Controller docs and try this ;

The ingress-nginx-controller helm-chart is a generic install out of the box. The default set of helm values is not configured for installation on any infra provider. The annotations that are applicable to the cloud provider must be customized by the users.
See [AWS LB Controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/service/annotations/).
Examples of some annotations needed for the service resource of --type LoadBalancer on AWS are below:

  annotations:
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "true"
    service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-something1 sg-something2"
    service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name: "somebucket"
    service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-prefix: "ingress-nginx"
    service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval: "5"

/close

k8s-ci-robot commented 1 month ago

@longwuyuan: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/9990#issuecomment-2345842536): >Read AWS Loadbalancer Controller docs and try this ; >``` >The ingress-nginx-controller helm-chart is a generic install out of the box. The default set of helm values is not configured for installation on any infra provider. The annotations that are applicable to the cloud provider must be customized by the users. >See [AWS LB Controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/service/annotations/). >Examples of some annotations needed for the service resource of --type LoadBalancer on AWS are below: > > annotations: > service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing" > service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp > service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true" > service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip" > service.beta.kubernetes.io/aws-load-balancer-type: nlb > service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "true" > service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true" > service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-something1 sg-something2" > service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name: "somebucket" > service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-prefix: "ingress-nginx" > service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval: "5" >``` > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 1 month ago

https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/service/annotations/

kubernetes / ingress-nginx

AWS - Randomly unhealty nodes in target groups #9990