kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.47k stars 8.25k forks source link

upstream timed out (110: Operation timed out) while .. #9443

Closed fred-aa closed 1 month ago

fred-aa commented 1 year ago

Error: upstream timed out (110: Operation timed out) while connecting to upstream. and upstream timed out (110: Operation timed out) while reading response header from upstream
It happens randomly (every 2-3 minutes or so) for all applications at the same time. I could resolve the timeout problem using annotations.

But why is it inconsistent? Sometimes it takes up to 2-3 minutes to get response from servers. Some of my servers are simple hello-world applications with "nginx" container images.

Environment:

AKS cluster with 1 master and 1 worker node (2 CPU + 4 GB memory).     
Ubuntu 18.04.6 LTS
**Tested** with both:
    NGINX Controller v1.5.1, helm version 4.4.4
    NGINX controller v1.2.1, helm version 4.1.3.

Have plenty of memory and CPU available. Increasing number of applications replicas does not solve my issue.

k8s-ci-robot commented 1 year ago

@fred-aa: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
naxo8628 commented 1 year ago

+1 we are facing the same problem here, a lot of upstream timed out (110: Operation timed out) while connecting to upstream.and upstream timed out (110: Operation timed out) while reading response header from upstream till 10h ago.

@fred-aa with annotations you put?

longwuyuan commented 1 year ago

/remove-kind bug

I could not reproduce this. So there must be some load, traffic or config specific factors causing this. Please provide a step-by-step instruction procedure for others to produce the problem at will on a minikube cluster or a kind cluster. Please make it simple enough to copy/paste from your instructions, for end-to-end reproduce effort, from creating cluster to creating workloads to creating requests .

naxo8628 commented 1 year ago

I installed the ingress-nginx throw HELM, with this values:

helm upgrade --install ingress-nginx ingress-nginx \ --repo https://kubernetes.github.io/ingress-nginx \ --set controller.config.disable-access-log="true" \ --set controller.config.error-log-level="warn" \ --set controller.admissionWebhooks.enabled="false" \ --set controller.autoscaling.enabled="true" \ --namespace ingress-nginx --create-namespace

this give me 11 replicas of nginx-controller, we have no erros till deploy 1 month ago, but now everytime got 50890 upstream timed out (110: Operation timed out) while reading response header from upstream in every request

longwuyuan commented 1 year ago

Since the error is a timeout and there are annotations configured for timeout values, try to remove the annotations related to timeout and update status. Just saying timeout is not enough to make progress so please add logs of the controller pod, log of application pod, etc at the time of the error. Until there is more data its hard to assume that there is a problem n the controller.

fred-aa commented 1 year ago

@longwuyuan Without the timeouts annotations I get similar logs as shown in this issue. I can not see any issues for communications between nodes.

longwuyuan commented 1 year ago

The idea was to establish state at the time of timeout and I listed some info that will help. Looking at that other issue and its content has not helped decide what can be next course of action.

I was expecting you will post commands and outputs like ;

Like I mentioned before, if your app pod is busy or the host its running on is busy or either are out of resources etc, can ot be determined just by looking at another issue's content.

And reason for asking is i am running 3 workloads on minikube with passthrough and non-passthrough and strress testing and I don't see the timeout error

amodsachintha commented 1 year ago

I am facing the same issue as well on AWS EKS (v1.24 and v1.25). I tried increasing timeouts, but to no avail. I thought it was a networking issue (btw, no error logs on the vpc-cni plugin) and therefore removed aws-vpc-cni and installed calico, but that did not fix it. Tested again with v1.4.0, v1.5.1 and the latest v1.6.4. The weird thing is that other deployments are working properly and some are recovering intermittently. Here's controller logs with v1.4.0 and calico. The deployments failing are a simple php app and the kubernetes dashboard (both worked earlier).

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.4.0
  Build:         50be2bf95fd1ef480420e2aa1d6c5c7c138c95ea
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.10

-------------------------------------------------------------------------------

W0308 16:01:20.042700       7 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0308 16:01:20.042834       7 main.go:209] "Creating API client" host="https://172.20.0.1:443"
I0308 16:01:20.053563       7 main.go:253] "Running in Kubernetes cluster" major="1" minor="25+" git="v1.25.6-eks-48e63af" state="clean" commit="9f22d4ae876173884749c0701f01340879ab3f95" platform="linux/amd64"
I0308 16:01:20.191543       7 main.go:104] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I0308 16:01:20.209341       7 ssl.go:533] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I0308 16:01:20.228686       7 nginx.go:260] "Starting NGINX Ingress controller"
I0308 16:01:20.249621       7 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"ingress-nginx-controller", UID:"a675086a-b232-4ec0-bf44-846230a8276c", APIVersion:"v1", ResourceVersion:"12352", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/ingress-nginx-controller
I0308 16:01:21.332896       7 store.go:430] "Found valid IngressClass" ingress="kubernetes-dashboard/kubernetes-dashboard-ingress" ingressclass="nginx"
I0308 16:01:21.333572       7 store.go:430] "Found valid IngressClass" ingress="pc-dev/example-web-ingress" ingressclass="nginx"
I0308 16:01:21.333966       7 store.go:430] "Found valid IngressClass" ingress="pc-dev/summarizer-app-ingress" ingressclass="nginx"
I0308 16:01:21.334163       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"kubernetes-dashboard", Name:"kubernetes-dashboard-ingress", UID:"11d64d69-2dbb-4839-a6ca-6ece0ebea21f", APIVersion:"networking.k8s.io/v1", ResourceVersion:"6445", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0308 16:01:21.334182       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"pc-dev", Name:"example-web-ingress", UID:"9824d39d-161d-4efd-8e4e-2418cf60dc9a", APIVersion:"networking.k8s.io/v1", ResourceVersion:"326107", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0308 16:01:21.334228       7 store.go:430] "Found valid IngressClass" ingress="pc-dev/summarizer-text-extractor-ingress" ingressclass="nginx"
I0308 16:01:21.334442       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"pc-dev", Name:"summarizer-app-ingress", UID:"b58b0e53-be25-4691-900b-edc8a57ab967", APIVersion:"networking.k8s.io/v1", ResourceVersion:"6629", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0308 16:01:21.334466       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"pc-dev", Name:"summarizer-text-extractor-ingress", UID:"2daea083-90de-43e3-8f66-0a5074fd598f", APIVersion:"networking.k8s.io/v1", ResourceVersion:"6637", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0308 16:01:21.431012       7 nginx.go:303] "Starting NGINX process"
I0308 16:01:21.431042       7 leaderelection.go:248] attempting to acquire leader lease ingress-nginx/ingress-nginx-leader...
I0308 16:01:21.435027       7 nginx.go:323] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I0308 16:01:21.435553       7 controller.go:168] "Configuration changes detected, backend reload required"
I0308 16:01:21.436278       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-6b5fc6b8b4-77zn5"
I0308 16:01:21.507529       7 controller.go:185] "Backend successfully reloaded"
I0308 16:01:21.507806       7 controller.go:196] "Initial sync, sleeping for 1 second"
I0308 16:01:21.508248       7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-5d86fb679c-w2dqg", UID:"50d47abf-850e-45c6-901b-38258638ec53", APIVersion:"v1", ResourceVersion:"369098", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
I0308 16:02:20.427978       7 leaderelection.go:258] successfully acquired lease ingress-nginx/ingress-nginx-leader
I0308 16:02:20.428079       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-5d86fb679c-w2dqg"
2023/03/08 16:02:35 [error] 31#31: *569 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:02:40 [error] 31#31: *569 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:02:45 [error] 31#31: *569 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
108.162.227.25 - - [08/Mar/2023:16:02:45 +0000] "GET / HTTP/1.1" 504 562 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1100 14.998 [pc-dev-example-web-service-80] [] 192.168.215.1:80, 192.168.215.1:80, 192.168.215.1:80 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504 a7c98d98ac870f64981a997cd2f373a1
2023/03/08 16:04:26 [error] 32#32: *1488 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.68.34.41, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:04:31 [error] 32#32: *1488 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.68.34.41, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:04:36 [error] 32#32: *1488 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.68.34.41, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
172.68.34.41 - - [08/Mar/2023:16:04:36 +0000] "GET / HTTP/1.1" 504 160 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.2 Safari/605.1.15" 742 15.001 [pc-dev-example-web-service-80] [] 192.168.215.1:80, 192.168.215.1:80, 192.168.215.1:80 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504 609b3af0a29debd459f45785048c7c8c
108.162.226.149 - - [08/Mar/2023:16:05:39 +0000] "GET / HTTP/1.1" 200 45 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1086 0.002 [pc-dev-summarizer-app-service-4000] [] 192.168.203.74:5000 45 0.004 200 bcf467f76d1806a985cbfa539e32497d
2023/03/08 16:05:39 [error] 31#31: *2096 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
108.162.226.251 - - [08/Mar/2023:16:05:42 +0000] "GET / HTTP/1.1" 200 91 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1090 0.002 [pc-dev-summarizer-text-extractor-service-4000] [] 192.168.203.75:3000 91 0.004 200 8be160fd79da18035ef74ee2737fca96
2023/03/08 16:05:44 [error] 31#31: *2096 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:05:49 [error] 31#31: *2096 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
108.162.227.25 - - [08/Mar/2023:16:05:49 +0000] "GET / HTTP/1.1" 504 562 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1126 15.008 [pc-dev-example-web-service-80] [] 192.168.215.1:80, 192.168.215.1:80, 192.168.215.1:80 0, 0, 0 5.000, 5.004, 5.004 504, 504, 504 deaefe97e08770d2ff6f7cc05e85967a
2023/03/08 16:05:52 [error] 32#32: *2215 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.226.135, server: np-k8s-dashboard.example.com, request: "GET / HTTP/1.1", upstream: "https://192.168.215.3:8443/", host: "np-k8s-dashboard.example.com"
2023/03/08 16:05:57 [error] 32#32: *2215 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.226.135, server: np-k8s-dashboard.example.com, request: "GET / HTTP/1.1", upstream: "https://192.168.215.3:8443/", host: "np-k8s-dashboard.example.com"
2023/03/08 16:06:02 [error] 32#32: *2215 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.226.135, server: np-k8s-dashboard.example.com, request: "GET / HTTP/1.1", upstream: "https://192.168.215.3:8443/", host: "np-k8s-dashboard.example.com"
108.162.226.135 - admin [08/Mar/2023:16:06:02 +0000] "GET / HTTP/1.1" 504 562 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1159 15.001 [kubernetes-dashboard-kubernetes-dashboard-443] [] 192.168.215.3:8443, 192.168.215.3:8443, 192.168.215.3:8443 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504 6ece79b7fc58d5d20c5b308eaad44f8b
2023/03/08 16:06:09 [error] 31#31: *2359 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.23, server: np-k8s-dashboard.example.com, request: "GET /favicon.ico HTTP/1.1", upstream: "https://192.168.215.3:8443/favicon.ico", host: "np-k8s-dashboard.example.com", referrer: "https://np-k8s-dashboard.example.com/"
2023/03/08 16:06:14 [error] 31#31: *2359 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.23, server: np-k8s-dashboard.example.com, request: "GET /favicon.ico HTTP/1.1", upstream: "https://192.168.215.3:8443/favicon.ico", host: "np-k8s-dashboard.example.com", referrer: "https://np-k8s-dashboard.example.com/"
172.71.150.63 - - [08/Mar/2023:16:06:18 +0000] "HEAD / HTTP/1.1" 200 0 "-" "Site24x7" 364 0.001 [pc-dev-summarizer-text-extractor-service-4000] [] 192.168.203.75:3000 0 0.000 200 852bf6287ddb9841d8461deeff676fd2
172.71.151.111 - - [08/Mar/2023:16:06:18 +0000] "HEAD / HTTP/1.1" 200 0 "-" "Site24x7" 360 0.001 [pc-dev-summarizer-app-service-4000] [] 192.168.203.74:5000 0 0.000 200 94c8d6f9e2d8843ff90b11de7c1986eb
2023/03/08 16:06:19 [error] 31#31: *2359 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.23, server: np-k8s-dashboard.example.com, request: "GET /favicon.ico HTTP/1.1", upstream: "https://192.168.215.3:8443/favicon.ico", host: "np-k8s-dashboard.example.com", referrer: "https://np-k8s-dashboard.example.com/"
108.162.227.23 - admin [08/Mar/2023:16:06:19 +0000] "GET /favicon.ico HTTP/1.1" 504 562 "https://np-k8s-dashboard.example.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1118 14.997 [kubernetes-dashboard-kubernetes-dashboard-443] [] 192.168.215.3:8443, 192.168.215.3:8443, 192.168.215.3:8443 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504 e9edc959a02abc0823d7b6382095afb8

Curl

*   Trying 104.21.20.159:443...
* TCP_NODELAY set
* Connected to dev-aws.example.com (104.21.20.159) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.example.com
*  start date: Jan 26 10:03:30 2023 GMT
*  expire date: Apr 26 10:03:29 2023 GMT
*  subjectAltName: host "dev-aws.example.com" matched cert's "*.example.com"
*  issuer: C=US; O=Let's Encrypt; CN=E1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x5590454692f0)
> GET / HTTP/2
> Host: dev-aws.example.com
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 504 
< date: Wed, 08 Mar 2023 16:25:42 GMT
< content-type: text/html
< cf-cache-status: DYNAMIC
< report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=x4Pi5TFU5x%2FsaUDI0but%2BhguGiYTarmFSwjwMiFQJjfeuLX29z8y2wCZTSxFFW83nhO6SwtV7g6McNqosaOeaqWURUk%2FcgYvVPAPWyQPFOL%2FDPwBUg%2BsKs8%2FWzVRyr5yGbQ%2Bq5Ee5m8KNUI%3D"}],"group":"cf-nel","max_age":604800}
< nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
< server: cloudflare
< cf-ray: 7a4c6f22ffe03fca-SIN
< alt-svc: h3=":443"; ma=86400, h3-29=":443"; ma=86400
< 
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
* Connection #0 to host dev-aws.example.com left intact
amodsachintha commented 1 year ago

I am facing the same issue as well on AWS EKS (v1.24 and v1.25). I tried increasing timeouts, but to no avail. I thought it was a networking issue (btw, no error logs on the vpc-cni plugin) and therefore removed aws-vpc-cni and installed calico, but that did not fix it. Tested again with v1.4.0, v1.5.1 and the latest v1.6.4. The weird thing is that other deployments are working properly and some are recovering intermittently. Here's controller logs with v1.4.0 and calico. The deployments failing are a simple php app and the kubernetes dashboard (both worked earlier).

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.4.0
  Build:         50be2bf95fd1ef480420e2aa1d6c5c7c138c95ea
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.10

-------------------------------------------------------------------------------

W0308 16:01:20.042700       7 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0308 16:01:20.042834       7 main.go:209] "Creating API client" host="https://172.20.0.1:443"
I0308 16:01:20.053563       7 main.go:253] "Running in Kubernetes cluster" major="1" minor="25+" git="v1.25.6-eks-48e63af" state="clean" commit="9f22d4ae876173884749c0701f01340879ab3f95" platform="linux/amd64"
I0308 16:01:20.191543       7 main.go:104] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I0308 16:01:20.209341       7 ssl.go:533] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I0308 16:01:20.228686       7 nginx.go:260] "Starting NGINX Ingress controller"
I0308 16:01:20.249621       7 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"ingress-nginx-controller", UID:"a675086a-b232-4ec0-bf44-846230a8276c", APIVersion:"v1", ResourceVersion:"12352", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/ingress-nginx-controller
I0308 16:01:21.332896       7 store.go:430] "Found valid IngressClass" ingress="kubernetes-dashboard/kubernetes-dashboard-ingress" ingressclass="nginx"
I0308 16:01:21.333572       7 store.go:430] "Found valid IngressClass" ingress="pc-dev/example-web-ingress" ingressclass="nginx"
I0308 16:01:21.333966       7 store.go:430] "Found valid IngressClass" ingress="pc-dev/summarizer-app-ingress" ingressclass="nginx"
I0308 16:01:21.334163       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"kubernetes-dashboard", Name:"kubernetes-dashboard-ingress", UID:"11d64d69-2dbb-4839-a6ca-6ece0ebea21f", APIVersion:"networking.k8s.io/v1", ResourceVersion:"6445", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0308 16:01:21.334182       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"pc-dev", Name:"example-web-ingress", UID:"9824d39d-161d-4efd-8e4e-2418cf60dc9a", APIVersion:"networking.k8s.io/v1", ResourceVersion:"326107", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0308 16:01:21.334228       7 store.go:430] "Found valid IngressClass" ingress="pc-dev/summarizer-text-extractor-ingress" ingressclass="nginx"
I0308 16:01:21.334442       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"pc-dev", Name:"summarizer-app-ingress", UID:"b58b0e53-be25-4691-900b-edc8a57ab967", APIVersion:"networking.k8s.io/v1", ResourceVersion:"6629", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0308 16:01:21.334466       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"pc-dev", Name:"summarizer-text-extractor-ingress", UID:"2daea083-90de-43e3-8f66-0a5074fd598f", APIVersion:"networking.k8s.io/v1", ResourceVersion:"6637", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
I0308 16:01:21.431012       7 nginx.go:303] "Starting NGINX process"
I0308 16:01:21.431042       7 leaderelection.go:248] attempting to acquire leader lease ingress-nginx/ingress-nginx-leader...
I0308 16:01:21.435027       7 nginx.go:323] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I0308 16:01:21.435553       7 controller.go:168] "Configuration changes detected, backend reload required"
I0308 16:01:21.436278       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-6b5fc6b8b4-77zn5"
I0308 16:01:21.507529       7 controller.go:185] "Backend successfully reloaded"
I0308 16:01:21.507806       7 controller.go:196] "Initial sync, sleeping for 1 second"
I0308 16:01:21.508248       7 event.go:285] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ingress-nginx", Name:"ingress-nginx-controller-5d86fb679c-w2dqg", UID:"50d47abf-850e-45c6-901b-38258638ec53", APIVersion:"v1", ResourceVersion:"369098", FieldPath:""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration
I0308 16:02:20.427978       7 leaderelection.go:258] successfully acquired lease ingress-nginx/ingress-nginx-leader
I0308 16:02:20.428079       7 status.go:84] "New leader elected" identity="ingress-nginx-controller-5d86fb679c-w2dqg"
2023/03/08 16:02:35 [error] 31#31: *569 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:02:40 [error] 31#31: *569 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:02:45 [error] 31#31: *569 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
108.162.227.25 - - [08/Mar/2023:16:02:45 +0000] "GET / HTTP/1.1" 504 562 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1100 14.998 [pc-dev-example-web-service-80] [] 192.168.215.1:80, 192.168.215.1:80, 192.168.215.1:80 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504 a7c98d98ac870f64981a997cd2f373a1
2023/03/08 16:04:26 [error] 32#32: *1488 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.68.34.41, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:04:31 [error] 32#32: *1488 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.68.34.41, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:04:36 [error] 32#32: *1488 upstream timed out (110: Operation timed out) while connecting to upstream, client: 172.68.34.41, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
172.68.34.41 - - [08/Mar/2023:16:04:36 +0000] "GET / HTTP/1.1" 504 160 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.2 Safari/605.1.15" 742 15.001 [pc-dev-example-web-service-80] [] 192.168.215.1:80, 192.168.215.1:80, 192.168.215.1:80 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504 609b3af0a29debd459f45785048c7c8c
108.162.226.149 - - [08/Mar/2023:16:05:39 +0000] "GET / HTTP/1.1" 200 45 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1086 0.002 [pc-dev-summarizer-app-service-4000] [] 192.168.203.74:5000 45 0.004 200 bcf467f76d1806a985cbfa539e32497d
2023/03/08 16:05:39 [error] 31#31: *2096 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
108.162.226.251 - - [08/Mar/2023:16:05:42 +0000] "GET / HTTP/1.1" 200 91 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1090 0.002 [pc-dev-summarizer-text-extractor-service-4000] [] 192.168.203.75:3000 91 0.004 200 8be160fd79da18035ef74ee2737fca96
2023/03/08 16:05:44 [error] 31#31: *2096 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
2023/03/08 16:05:49 [error] 31#31: *2096 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.25, server: dev-aws.example.com, request: "GET / HTTP/1.1", upstream: "http://192.168.215.1:80/", host: "dev-aws.example.com"
108.162.227.25 - - [08/Mar/2023:16:05:49 +0000] "GET / HTTP/1.1" 504 562 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1126 15.008 [pc-dev-example-web-service-80] [] 192.168.215.1:80, 192.168.215.1:80, 192.168.215.1:80 0, 0, 0 5.000, 5.004, 5.004 504, 504, 504 deaefe97e08770d2ff6f7cc05e85967a
2023/03/08 16:05:52 [error] 32#32: *2215 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.226.135, server: np-k8s-dashboard.example.com, request: "GET / HTTP/1.1", upstream: "https://192.168.215.3:8443/", host: "np-k8s-dashboard.example.com"
2023/03/08 16:05:57 [error] 32#32: *2215 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.226.135, server: np-k8s-dashboard.example.com, request: "GET / HTTP/1.1", upstream: "https://192.168.215.3:8443/", host: "np-k8s-dashboard.example.com"
2023/03/08 16:06:02 [error] 32#32: *2215 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.226.135, server: np-k8s-dashboard.example.com, request: "GET / HTTP/1.1", upstream: "https://192.168.215.3:8443/", host: "np-k8s-dashboard.example.com"
108.162.226.135 - admin [08/Mar/2023:16:06:02 +0000] "GET / HTTP/1.1" 504 562 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1159 15.001 [kubernetes-dashboard-kubernetes-dashboard-443] [] 192.168.215.3:8443, 192.168.215.3:8443, 192.168.215.3:8443 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504 6ece79b7fc58d5d20c5b308eaad44f8b
2023/03/08 16:06:09 [error] 31#31: *2359 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.23, server: np-k8s-dashboard.example.com, request: "GET /favicon.ico HTTP/1.1", upstream: "https://192.168.215.3:8443/favicon.ico", host: "np-k8s-dashboard.example.com", referrer: "https://np-k8s-dashboard.example.com/"
2023/03/08 16:06:14 [error] 31#31: *2359 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.23, server: np-k8s-dashboard.example.com, request: "GET /favicon.ico HTTP/1.1", upstream: "https://192.168.215.3:8443/favicon.ico", host: "np-k8s-dashboard.example.com", referrer: "https://np-k8s-dashboard.example.com/"
172.71.150.63 - - [08/Mar/2023:16:06:18 +0000] "HEAD / HTTP/1.1" 200 0 "-" "Site24x7" 364 0.001 [pc-dev-summarizer-text-extractor-service-4000] [] 192.168.203.75:3000 0 0.000 200 852bf6287ddb9841d8461deeff676fd2
172.71.151.111 - - [08/Mar/2023:16:06:18 +0000] "HEAD / HTTP/1.1" 200 0 "-" "Site24x7" 360 0.001 [pc-dev-summarizer-app-service-4000] [] 192.168.203.74:5000 0 0.000 200 94c8d6f9e2d8843ff90b11de7c1986eb
2023/03/08 16:06:19 [error] 31#31: *2359 upstream timed out (110: Operation timed out) while connecting to upstream, client: 108.162.227.23, server: np-k8s-dashboard.example.com, request: "GET /favicon.ico HTTP/1.1", upstream: "https://192.168.215.3:8443/favicon.ico", host: "np-k8s-dashboard.example.com", referrer: "https://np-k8s-dashboard.example.com/"
108.162.227.23 - admin [08/Mar/2023:16:06:19 +0000] "GET /favicon.ico HTTP/1.1" 504 562 "https://np-k8s-dashboard.example.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" 1118 14.997 [kubernetes-dashboard-kubernetes-dashboard-443] [] 192.168.215.3:8443, 192.168.215.3:8443, 192.168.215.3:8443 0, 0, 0 5.000, 5.000, 5.000 504, 504, 504 e9edc959a02abc0823d7b6382095afb8

Curl

*   Trying 104.21.20.159:443...
* TCP_NODELAY set
* Connected to dev-aws.example.com (104.21.20.159) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.example.com
*  start date: Jan 26 10:03:30 2023 GMT
*  expire date: Apr 26 10:03:29 2023 GMT
*  subjectAltName: host "dev-aws.example.com" matched cert's "*.example.com"
*  issuer: C=US; O=Let's Encrypt; CN=E1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x5590454692f0)
> GET / HTTP/2
> Host: dev-aws.example.com
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 504 
< date: Wed, 08 Mar 2023 16:25:42 GMT
< content-type: text/html
< cf-cache-status: DYNAMIC
< report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=x4Pi5TFU5x%2FsaUDI0but%2BhguGiYTarmFSwjwMiFQJjfeuLX29z8y2wCZTSxFFW83nhO6SwtV7g6McNqosaOeaqWURUk%2FcgYvVPAPWyQPFOL%2FDPwBUg%2BsKs8%2FWzVRyr5yGbQ%2Bq5Ee5m8KNUI%3D"}],"group":"cf-nel","max_age":604800}
< nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
< server: cloudflare
< cf-ray: 7a4c6f22ffe03fca-SIN
< alt-svc: h3=":443"; ma=86400, h3-29=":443"; ma=86400
< 
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
* Connection #0 to host dev-aws.example.com left intact

This was an issue with the security group that controlled inter-node communications in the EKS terraform module.

longwuyuan commented 1 year ago

@fred-aa if you also had the same problem as @amodsachintha and if the issue is resolved now, then kindly close the issue. thanks

frixou89 commented 1 year ago

I'm facing a similar issue with random timeouts. Can you let us know if you have solved and what was the issue? Thanks

longwuyuan commented 1 year ago

If any observes, there is not a single post here, that shows very basic information about the live state, as outputs of commands. For example ;

So the actual event of timeout can not be commented on by analyzing data from the cluster and the data on the request

crossRT commented 1 year ago

thank you for the clue. @amodsachintha . It resolved my problem.

@frixou89 in case your problem still exists. kindly refer to https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/network_connectivity.md#security-groups

add this block to your EKS terraform module:

# Extend node-to-node security group rules
  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }

this is to allow communication between nodes in the cluster. the reason ingress throwing "Operation timed out" is because internally it proxies the request from 1 node to another node.

ethanhinson commented 1 year ago

Thanks @crossRT that was also my problem. In my case this issue was well hidden because nginx would timeout trying to connect node to node, and then use the current node to deliver the app. I found the problem originally by tailing the logs of the ingress-nginx namespace and saw the timeout on every request.

tudor-pop-mimedia commented 2 months ago

thank you for the clue. @amodsachintha . It resolved my problem.

@frixou89 in case your problem still exists. kindly refer to https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/network_connectivity.md#security-groups

add this block to your EKS terraform module:

# Extend node-to-node security group rules
  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }

this is to allow communication between nodes in the cluster. the reason ingress throwing "Operation timed out" is because internally it proxies the request from 1 node to another node.

I used AWS terraform eks to create the node which has this funny property node_security_group_enable_recommended_rules that allows node-to-node communication on all ports 1025 - 65535 The problem is that ingress-nginx I believe to send send the traffic over port 80 (as configured in my other services) or 8080 or another popular port is 3000 (node) in which case you need to allow that as well.

frixou89 commented 2 months ago

thank you for the clue. @amodsachintha . It resolved my problem.

@frixou89 in case your problem still exists. kindly refer to https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/network_connectivity.md#security-groups

add this block to your EKS terraform module:

# Extend node-to-node security group rules
  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }

this is to allow communication between nodes in the cluster. the reason ingress throwing "Operation timed out" is because internally it proxies the request from 1 node to another node.

Thanks for your reply! I'm not sure I remember what the issue was. This is what I have right now in a working setup for node_security_group_additional_rules.

  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
  }
longwuyuan commented 1 month ago

This issue has information that shows many configuration options customized and changes. A vanilla install of the controller as documented without customizations and a vanilla upstream like nginx:alpine deployment with one replica does not reproduce timeouts.

So this means that there are environment specific factors causing timeouts. It could be as simple as load on the users's cluster network or transient changes as one user has autoscaling enabled.

This information does not show any action item on the project. The issue being open is just adding to the tally of open issues without any action item. The project needs to focus on securing the controller by default, reducing support and maintenance of non Ingress-API features, and also implement the Gateway-API. So am closing this issue.

/close

k8s-ci-robot commented 1 month ago

@longwuyuan: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/9443#issuecomment-2337824234): >This issue has information that shows many configuration options customized and changes. A vanilla install of the controller as documented without customizations and a vanilla upstream like nginx:alpine deployment with one replica does not reproduce timeouts. > >So this means that there are environment specific factors causing timeouts. It could be as simple as load on the users's cluster network or transient changes as one user has autoscaling enabled. > >This information does not show any action item on the project. The issue being open is just adding to the tally of open issues without any action item. The project needs to focus on securing the controller by default, reducing support and maintenance of non Ingress-API features, and also implement the Gateway-API. So am closing this issue. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
Vinaum8 commented 1 month ago

thank you for the clue. @amodsachintha . It resolved my problem.

@frixou89 in case your problem still exists. kindly refer to https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/network_connectivity.md#security-groups

add this block to your EKS terraform module:

# Extend node-to-node security group rules
  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }

this is to allow communication between nodes in the cluster. the reason ingress throwing "Operation timed out" is because internally it proxies the request from 1 node to another node.

Love you! <3