kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
109.66k stars 39.27k forks source link

Sometime Liveness/Readiness Probes fail because of net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) #89898

Closed yuripastushenko closed 1 year ago

yuripastushenko commented 4 years ago

What happened: In my cluster sometimes readiness the probes are failing. But the application works fine.

Apr 06 18:15:14 kubenode** kubelet[34236]: I0406 18:15:14.056915   34236 prober.go:111] Readiness probe for "default-nginx-daemonset-4g6b5_default(a3734646-77fd-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.123.127:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

What you expected to happen: Successful Readiness Probe.

How to reproduce it (as minimally and precisely as possible): We have few clusters with different workloads. Only in cluster with big number of short living pods we have this issue. But not on all nodes. We can't reproduce this error on other our clusters (that have same configuration but different workload). How i found the issue? I deployed a daemonset:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: default-nginx-daemonset
  namespace: default
  labels:
    k8s-app: default-nginx
spec:
  selector:
    matchLabels:
      name: default-nginx
  template:
    metadata:
      labels:
        name: default-nginx
    spec:
      tolerations:
      - operator: Exists
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          limits:
            cpu: "1"
            memory: "1Gi"
          requests:
            cpu: "1"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /
            port: 80

Then i started to listen events on all pods of this daemonset. After a couple of time i received next events:

Warning  Unhealthy  110s (x5 over 44m)  kubelet, kubenode20  Readiness probe failed: Get http://172.18.122.143:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  11m (x3 over 32m)  kubelet, kubenode10  Readiness probe failed: Get http://172.18.65.57:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
....

Those events where on ~50% of pods of this daemonset.

On the nodes where the pods with failed probes was placed, I collected the logs of kubelet. And there was errors like:

Apr 06 14:08:35 kubenode20 kubelet[10653]: I0406 14:08:35.464223   10653 prober.go:111] Readiness probe for "default-nginx-daemonset-nkwkf_default(90a3883b-77f3-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.122.143:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

I was thinking that sometimes the nginx in pod really response slowly. For excluding this theory, I created a short script that curl the application in pod and store response time in a file:

while true; do curl http://172.18.122.143:80/ -s -o /dev/null -w  "%{time_starttransfer}\n" >> /tmp/measurment.txt; done;

I run this script on node where the pod is placed for 30 minutes and i get the following:

$ cat /tmp/measurment.txt | sort -u
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007

$ cat /tmp/measurment.txt | wc -l
482670

There was 482670 measurements and the longest response time was 0.007.

In logs of pod there are only message with response code 200 (from my requests and requests of readiness probes):

[06/Apr/2020:14:06:30 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
......
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
[06/Apr/2020:14:08:35 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
......
[06/Apr/2020:14:08:41 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"

It means that part of probes are successful.

Then i stopped the curl script (because the big number of logs). I waited while new error with failed probe appears in kubelet logs.

Apr 06 18:15:14 kubenode13 kubelet[34236]: I0406 18:15:14.056915   34236 prober.go:111] Readiness probe for "default-nginx-daemonset-4g6b5_default(a3734646-77fd-11ea-ad94-509a4c9f2810):nginx" failed (failure): Get http://172.18.123.127:80/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

And in logs of that pod with nginx I didn't find the request generated by this probe:

kubectl logs default-nginx-daemonset-4g6b5 | grep "15:15"
[06/Apr/2020:18:15:00 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:20 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:30 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:40 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"
[06/Apr/2020:18:15:50 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.12" "-"

If I restart the kubelet - the error don't disappear. Have someone any suggestions about this?

Environment:

Seems like the problem appears in many different installations - https://github.com/kubernetes/kubernetes/issues/51096

/sig network

yuripastushenko commented 4 years ago

/sig network

athenabot commented 4 years ago

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

πŸ€– I am a bot run by vllry. πŸ‘©β€πŸ”¬

Nittarab commented 4 years ago

Are you sure the application don't hit the resources limits?

In my case, the application starting fine, then the container start using more resources until he hit the limit. After that the readiness probes fail

yuripastushenko commented 4 years ago

Yes, I am sure. I collected the metrics. Cpu usage on pod was around 0.001 Ram - around 4mb. This is a test application (pure nginx image) that did not do anything (no traffic is sent to it).

yuripastushenko commented 4 years ago

/area kubelet

manikanta-kondeti commented 4 years ago

any update on this? we have been facing similar issues since few weeks

thockin commented 4 years ago

This is the first such report I have seen. There's nothing obvious about why this would happen.

It's possible kubelet is too busy and starved for CPU and the probe happened to be thing that got thrashed. How many pods are on this machine? How busy is it?

It's possible the node itself is thrashing or something and OOM behavior is weird. Does dmesg show any OOMs?

It's possible some other failure down deep in kubelet is being translated into this? You could try running kubelet at a higher log level to get more details on what is happening.

A lot of bugs have been fixed since 1.12, so we'd need to try to reproduce this and then try again in a more recent version. Is there any way you can help boil down a simpler reproduction?

athenabot commented 4 years ago

@thockin If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

πŸ€– I am a bot run by vllry. πŸ‘©β€πŸ”¬

manikanta-kondeti commented 4 years ago

@thockin
returns-console-558995dd78-b6bjm 0/1 Running 0 23h

For example, if you see the above get pods, readiness probe of one pod is failing almost every day. A restart would fix this. However, we're unable to find the root cause. But I don't see any abnormal numbers on CPU or memory or thread count. Doing a describe pod would give me that readiness has failed. How do we debug in such scenario?

Also, this is happening for this particular deployment only. The other pods are running fine.

"returns-console-558995dd78-pbjf8 1/1 Running 0 23h returns-console-558995dd78-plqdr 1/1 Running 0 23h returns-console-558995dd78-tlb5k 1/1 Running 0 23h returns-console-558995dd78-tr2kd 1/1 Running 0 23h returns-console-558995dd78-vfj2n 1/1 Running 0 23h"

tarunwadhwa13 commented 4 years ago

We too are facing the same issue in our cluster.

Kubernetes Version - 1.16.8 Kernel - 4.4.218-1 OS - Centos 7

Installed using Kubespray

In our case, timeouts were not related to application but related to specific nodes in cluster. In our 13 node cluster, node 1 to 4 had some kind of issue wherein the pods running on these nodes had random failures due to timeouts.

Checked that there weren't any cpu aur memory usage spikes also.

P.S We are using NodePort for production use case. Is it possible that the node port setup cannot handle too many socket connections?

thockin commented 4 years ago

I have no idea what might cause spurious probe failues. @derekwaynecarr have you heard any other reports of this?

@tarunwadhwa13 are you saying PROBES failed (always same node) or user traffic failed? If you have any other information about what was going on with those nodes when the failures happen, it would help. Check for OOMs, high CPU usage, conntrack failures?

tarunwadhwa13 commented 4 years ago

@thockin Conntrack shows hardly 2 or 3 errors. Memory consumption is 60-65% per node.

Just found that the timeouts were for almost all request and not just probe. We added istio lately to check connection stats and understand if the behaviour was due to application. But the findings were weird. Istio itself is now failing readiness probe quite frequently

image

157 failures in ~3 hours

Would like to add that kubernetes is running in our Datacenter. And since iptables version is 1.4.21, --random-fully is not being implemented. But since all machines have same configuration, we ruled out this possibility

casey-robertson commented 4 years ago

I apologize for not having a lot of details to share but I'd add my 2 cents. We recently upgraded from Istio 1.4.4 to 1.5.4 and started seeing the issues described by OP. Lots of liveness / readiness issues there were not happening before. It SEEMS like adding let's say a 20sec initial delay had helped in most cases. At this point we are still seeing it and not sure what the root cause is.

running on EKS 1.15 (control plane) / 1.14 managed nodeGroups

athenabot commented 4 years ago

@thockin If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

πŸ€– I am a bot run by vllry. πŸ‘©β€πŸ”¬

thockin commented 4 years ago

I'm afraid the only way to know more is to get something like a tcpdump from both inside and outside the pod, which captures one or more failing requests.

Possible?

den-is commented 4 years ago

I'm getting same issue. Couple Nginx+PHP Pods running on huge instances in parallel with couple other small applications. These are staging apps+nodes without constant traffic. I constantly receive notification that these Nginx+PHP app has restarted... specifically Nginx container of these Pods. At the same time other apps running in the same namespace, nodes never restart.

k -n staging get events
...
22m         Warning   Unhealthy                  pod/myapp-staging3-59f75b5d49-5tscw    Liveness probe failed: Get http://100.100.161.197:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
23m         Warning   Unhealthy                  pod/myapp-staging4-7589945cc6-2nf4s    Liveness probe failed: Get http://100.100.161.198:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
11m         Warning   Unhealthy                  pod/myapp-staging5-84fb4494db-5dvph    Liveness probe failed: Get http://100.104.124.220:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
...

Liveness on an Nginx container looks like this:

  liveness:
    initialDelaySeconds: 10
    periodSeconds: 10
    failureThreshold: 3
    httpGet:
      path: /list/en/health/ping
      port: 80

UPD: Strange thing is that completely distinct deployments you can see above staging4 staging5 stagingN - above 10 deployments fail at once.

My possible problem might be in missing timeoutSeconds which is default 1s

rudolfdobias commented 4 years ago

Having the same problem here.

livenessProbe:
        httpGet:
          path: /status
          port: 80
        initialDelaySeconds: 30
        periodSeconds: 10
        timeoutSeconds: 10
        failureThreshold: 3

Error cca 3 - 10 times a day:

Liveness probe failed: Get http://10.244.0.154:8002/status: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The service operates normally and responds to /status in 5ms, though.

⚠️ Also, and that is a bigger problem, at similar random times some pods refuse to connect to each other.

Running on Azure Kubernetes Service V 1.14.8

arjun921 commented 4 years ago

I'm facing the same issue as well, increasing timeoutSeconds didn't help.

          livenessProbe:
            httpGet:
              path: /ping
              port: http
            failureThreshold: 5
            initialDelaySeconds: 5
            periodSeconds: 20
            timeoutSeconds: 10
          readinessProbe:
            httpGet:
              path: /ping
              port: http
            initialDelaySeconds: 5
            periodSeconds: 20
            timeoutSeconds: 10

Runing on Kubernetes v1.16.7 on AWS deployed via KOPS

thockin commented 4 years ago

I appreciate the extra reports. It sounds like something is really weird. I'll repeat myself from above:

I'm afraid the only way to know more is to get something like a tcpdump from both inside and outside the pod, which captures one or more failing requests.

Is that possible? Without that I am flying very blind. I don't see this experimentally and I'm not flooded with reports of this, so it's going to be difficult to pin down. If you say you see it repeatedly, please try to capture a pcap?

arjun921 commented 4 years ago

@thockin I'll try to get a dump if I'm able to replicate this issue consistently, since it tends to happen randomly. Just to clarify, what exactly did you mean by tcpdump outside the pod? Did you mean tcpdump of the node where the pod resides?

Sorry, I'm relatively new to this :sweat_smile:

athenabot commented 4 years ago

@thockin If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

πŸ€– I am a bot run by vllry. πŸ‘©β€πŸ”¬

fscz commented 4 years ago

/remove-triage unresolved

bigbitbus commented 4 years ago

We are also seeing this issue - trying to run a Django+WSGI+nginx server that works on a lower version of K8s, when we try this on Linode's managed Kubernetes service - LKE we see this

  Type     Reason     Age    From                                Message
  ----     ------     ----   ----                                -------
  Normal   Scheduled  2m45s  default-scheduler                   Successfully assigned be/bigbitbus-stack-64c7b65b97-c2cth to lke6438-8072-5eecc7cc4d98
  Normal   Pulled     2m45s  kubelet, lke6438-8072-5eecc7cc4d98  Container image "##############/bigbitbus/bigbitbus-$$$$$$-server:###########" already present on machine
  Normal   Created    2m44s  kubelet, lke6438-8072-5eecc7cc4d98  Created container bbbchart
  Normal   Started    2m44s  kubelet, lke6438-8072-5eecc7cc4d98  Started container bbbchart
  Warning  Unhealthy  9s     kubelet, lke6438-8072-5eecc7cc4d98  Liveness probe failed: Get http://10.2.1.32:80/api/jindahoon/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
den-is commented 4 years ago

UPD: So since my last comment, I've updated timeouts to something more realistic. e.g.:

  livenessProbe:
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 3
    successThreshold: 1
    httpGet:
      path: /health/ping
      port: 80

Everything was quiet since then. But the day before yesterday, and yesterday: My prometheus-operator> alertmanger and slack have have exploded. And today by magic it is silent again not a single alert.

My apps is a usual PHP+Nginx pod. Staging environments deployed on 3 huge boxes. Just a dozen preview deploys which are deployed only once, and then opened by devs only once. No load at all. For example, the app had 10 staging deploys. so app-staging1..10 And all containers of exactly that one app were failing all at once. i repeat virtually not related deployments spread on 3 nodes. In pod, only Nginx containers were restarting. In logs, I was able to catch that actual Nginx was answering 499 to the liveness probe. Nothing else. Still investigating.

While ~5 other PHP apps deploys on the same staging nodes never restart.

Siddharthk commented 4 years ago

I am also facing the same issue. Running on Kubernetes v1.16.8 on AWS EKS. Any workarounds?

zaabskr commented 4 years ago

I am also facing the same issue with AKS 1.16.9. This is occuring consistently. Is there any way to investigate the issue?

AndriiNeverov commented 4 years ago

@zaabskr

Looks like this error message is a symptom of a more general problem of node-to-pod connectivity issues.

In our case, we confirmed that by using traceroute from the node to the pod's internal IP as suggested in https://discuss.kubernetes.io/t/who-where-actually-work-liveness-probe-in-kubernetes/4771/6. That IP was perfectly routable from the other pods and even from many other nodes, but not from that particular node it ended up on. The root cause turned out to be that the br-int (bridge) interface that connects the pod networks and the NSX overlay logical switch was failing to come up because of missing static routes. Fixing that magically solved the problem.

In other cases, the root cause may be different, but the first step would be to confirm whether the node can actually access that pod, e.g. ssh and traceroute or curl into the readiness probe (curl -v http://172.70.3.3:15020/healthz/ready when Istio sidecar containers fail to get ready), etc.

chasebolt commented 4 years ago

experiencing this issue in multiple GKE clusters. just started about 5 days ago. right now i'm having to drain the bad node and remove it from the pool. running 1.16.9-gke.6

colotiline commented 4 years ago

My issue was in a REST API dependency that was checked by readinessProbe. Doubles readinessTimeout helped.

23ewrdtf commented 4 years ago

What happened: After updating EKS from v1.14.9-eks-f459c0 to v1.15.11-eks-065dce some of our ReplicaSets and DaemonSets Liveness/Readiness probes started failing with Liveness probe failed: Get http://xxx.xxx.xxx.xxx:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers

It seems that random pods are affected. I can successfully curl to them from other pods.

All nodes and pods are fine, not CPU/Memory/Storage issues.

Two nodes are running 1.15 and the rest 1.14. There doesn't seem to be a correlation between node version and the issue.

Nothing obvious in kubectl cluster-info dump or journalctl -r or kube-proxy.log

Probes are mostly configured like this (truncated):

    livenessProbe:
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

    readinessProbe:
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

I will increase the timeouts but it would be good to solve this. What changed in 1.15?

What you expected to happen: Liveness/Readiness probes not to fail.

How to reproduce it (as minimally and precisely as possible): Upgrade EKS 1.14 to 1.15

Environment:

1.15

$ ami-0c42d7bd0e31ee2fe amazon-eks-node-1.15-v20200814
$ uname -a
Linux ip-xxxxxx.xxxxxx.compute.internal 4.14.186-146.268.amzn2.x86_64 #1 SMP Tue Jul 14 18:16:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
casey-robertson commented 4 years ago

@zaabskr

Looks like this error message is a symptom of a more general problem of node-to-pod connectivity issues.

In our case, we confirmed that by using traceroute from the node to the pod's internal IP as suggested in https://discuss.kubernetes.io/t/who-where-actually-work-liveness-probe-in-kubernetes/4771/6. That IP was perfectly routable from the other pods and even from many other nodes, but not from that particular node it ended up on. The root cause turned out to be that the br-int (bridge) interface that connects the pod networks and the NSX overlay logical switch was failing to come up because of missing static routes. Fixing that magically solved the problem.

In other cases, the root cause may be different, but the first step would be to confirm whether the node can actually access that pod, e.g. ssh and traceroute or curl into the readiness probe (curl -v http://172.70.3.3:15020/healthz/ready when Istio sidecar containers fail to get ready), etc.

For those of us running a managed service....this is not something we should ever have to deal with. What version and where are you running k8s?

sownak commented 4 years ago

I am using AWS EKS 1.16 and have been getting the same problems recently.

23ewrdtf commented 4 years ago

Update: After two days of checking everything (including cluster dump) the issue was with the application/pod itself. The application was failing to respond to Liveness/Readiness probes (in a specific scenario) and it was killed. Initially no obvious reason for this was found, and it correlated with EKS upgrade, so we blamed (wrongly) eks 1.15.

initialDelaySeconds didn't help as the pod was fine starting up, timeoutSeconds did help.

Things to test:

CharlyRipp commented 4 years ago

Dealing with the same issue on a GKE cluster, except with a twist. Not to add to the fire but maybe help other's have the same issue.

> kubectl version
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.11-gke.5", GitCommit:"baccd25d44f1a0d06ad3190eb508784efbb990a5", GitTreeState:"clean", BuildDate:"2020-06-25T22:55:26Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}

I have a spring-boot app with actuator - running actuator on a separate port-

        name: gateway
        ports:
        - containerPort: 8080
          name: http-server
          protocol: TCP
        - containerPort: 8000
          name: http-actuator
          protocol: TCP
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /actuator/health
            port: http-actuator
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /actuator/health
            port: http-actuator
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

In this setup, using named container ports, it events:

31s         Warning   Unhealthy           pod/gateway-547c5fb7b-rdtkt     Readiness probe failed: Get http://10.52.2.193:8000/actuator/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
31s         Warning   Unhealthy           pod/gateway-547c5fb7b-rdtkt     Liveness probe failed: Get http://10.52.2.193:8000/actuator/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

The following works:

kubectl port-forward gateway-547c5fb7b-rdtkt 8000:8000
curl 127.0.0.1:8000/actuator/health

# AND

kubectl exec -t busybox-b879ddf9c-jv4mw -- curl 10.52.2.193:8000/actuator/health

When removing the use of named containerPorts for readiness/liveness it passed the checks:

        name: gateway
        ports:
        - containerPort: 8080
          name: http-server
          protocol: TCP
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /actuator/health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /actuator/health
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

I haven't been able to find logs/records of what endpoint it's calling, but I suspect it's still pod IP + port, but for some reason it works?

nick-cromwell commented 4 years ago

I'm also seeing this 3+ times a day after upgrading my cluser to 1.16.11-gke.5 on GKE a few weeks ago. I ignored it for a while but the past few days I've been running load tests against my services with k6 and I'm seeing sporadic latency in my cluster. I'm beginning to think it's related.

After a full day spent trying to narrowing it down I can confirm the latency is any network connection between any pod and its connection to redis, cloudsql, or one another over HTTP.

I'm still investigating, but I've shut down all external logging (Datadog) and when under moderate load it shows up every 5-6 seconds with a 200-1500ms delay affecting any and all pods.

I can't speak for the loads these other systems are under, but my guess is the health check readiness timeout is a symptom of a more pervasive networking issue.

EDIT: I should add, the readiness timeouts occur when run against my Nginx Ingress which was installed via Helm - not against a pod containing a container I built or configured readiness probes for.

sownak commented 3 years ago

Now facing this mainly for the tiller pod (helm), all other pods are generally working fine. I did shutdown all the EKS worker nodes and then recreated, that seems to solve the issue. Looks like a kubernetes 1.16 issue to me.

nick-cromwell commented 3 years ago

I can't share a full tcpdump of our internal services, but I have done tcpdumps and have some more to share about this issue. The first is I upgraded my cluster to GKE cluster to 1.17.9-gke.1503, and upgraded my nginx ingress via Helm (ingress-nginx/ingress-nginx) to chart version ingress-nginx-2.15.0. I'm still seeing an occasional timeouts but I still don't believe this is isolated to the ingress.

When I have the system under moderate load I can see the my service throughput bottoms out occasionally and it's affecting every connection in the cluster. With throughput is consistent my service response time is averaging ~50ms, when throughput bottoms out it jumps to 700+ms.

This shows throughput from Service-A to Service-B: ApiGateway_Projects

And this shows throughput from Service-B to its highly available Redis store: Projects_Redis

At the time of the capture everything had been cached in Redis, but the same issues are seen when querying the managed database. The throughput drops also line up with TCP DUP ACK's to our external logging platform.

I'm trying to wade through tcpdumps of kube-dns and kube-proxy but they're a lot to look at.

This is our staging cluster, a 4 node n1-standard-1 cluster all running in us-central1-c where I have limited the number of service instances to 1 pod per service. However, the problem is consistent on the more beefy production cluster where all pod's throughput drops at fairly consistent intervals.

scolbeck commented 3 years ago

We had this same problem that ended up being due to our pod consuming more CPU than allowed by the pod's resources limits cpu. It seemed this prevented networking in general to be blocked to the pod during the high CPU time.

lazystone commented 3 years ago

@scolbeck we faced that issue, but in our case(k8s 1.19.2, Ubuntu 18.04) we disabled CPU limits completely(removed them from deployment descriptors and kubelet is running with --cpu-cfs-quota=false) - that didn't help.

lazystone commented 3 years ago

@CharlyRipp do I understand correctly that for you issue was resolved when instead of named ports you started to use actual ports in probes?

winds6206 commented 3 years ago

Hi ! I am facing the same issue, I have ever happened the problem on GKE 1.14&1.15 . My cluster have two deployment with nginx container , these deployment have set liveness/readiness probe. Only one deployment(six replica) sometimes all pods get unhealth at the same time.

Liveness probe failed: Get http://10.28.6.19:80/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get http://10.28.6.19:80/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Is possible the issue happened on nginx container with liveness/readiness probe?

jordips commented 3 years ago

Hi!

Same issue here:

I'm getting "Liveness probe failed: Get XXX net/http: request canceled (Client.Timeout exceeded while awaiting headers)" randomly every 3h o 4h for some of our k8s applications (not readiness problems). In our case, we check that:

Is not disruptive as you can see, but we have random alerts in out monitoring systems. So it would be nice to find the reason. If I find something more, will report.

EDIT: Modified timeoutSeconds on Liveness but still getting errors. Keep looking...

JamesDowning commented 3 years ago

Same problem here:

Readiness probe failed: Get http://xxxxx:80/healthcheck/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Running the same request via a pod in the same namespace curl http://xxxxx:80/healthcheck/ returns a 200 response.

EDIT: Even setting longer timeouts behaviour is extremely inconsistent (No Readiness probe failures to several counts)

initialDelaySeconds: 6
timeoutSeconds: 10
brianpham commented 3 years ago

Ran into a similar issue this week after upgrading EKS 1.17 to 1.18 where my grafana cluster would come partially up. 2 of the 3 pods would return with liveness or readiness probe failed connection refused. I tried many different solutions with no luck.

Eventually, the only thing that seem to fixed the issue was performing a rolling restart of the worker nodes as suggested by @sownak Now all the pods are coming up successfully even after I manually terminate them.

jicki commented 3 years ago

We too are facing the same issue in our cluster.

wyyl1 commented 3 years ago

Please make sure your all working nodes can access the Internet. I also encountered a similar problem.

$ kubectl get node

NAME           STATUS   ROLES    AGE   VERSION
k8s-master     Ready    master   16h   v1.18.5
k8s-work1   Ready    <none>   16h   v1.18.5
k8s-work2   Ready    <none>   15h   v1.18.5

The reason is: k8s-master and k8s-work2 can go online k8s-work1 no internet

jicki commented 3 years ago

@yuripastushenko Hello is now the problem solved?

wufengzlw commented 3 years ago

Same issue, sometimes I get this K8S event:

Liveness probe failed: Get http://192.168.87.238:8100/status: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

ghost commented 3 years ago

We're also seeing the same issue: Random failures of the Liveness and Readiness probes (that we are very very sure are not rooted in the application) with net/http: request canceled (Client.Timeout exceeded while awaiting headers) using Kubernetes 1.15

Managed EKS at AWS: Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-065dce", GitCommit:"065dcecfcd2a91bd68a17ee0b5e895088430bd05", GitTreeState:"clean", BuildDate:"2020-07-16T01:44:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

sebastian-schuth commented 3 years ago

Adding to this: (@cseifert-betterdoc and me both work on the same issue within our K8s / EKS cluster, so the version etc. poste apply to this)

I think we have a proof that there is a networking issue between the Kubelet and the container.

Our Problem

We have had

We have had problems with our cluster, as is restarted contains seemingly random, giving failing health checks as reason.

Of course, we started to dig into why our container fails (as, we all know, "SQL is not broken"). After a lot of deep and thought digging into metrics and logs, we came to the conclusion that our container (running a rail based web app) is just fine.

What we did

We concluded that our network setup is somehow flawed (being strange, given that we use EKS for a reason…), so we switched the health check to a simple exec command to see if this mitigates the problem.

We left the readiness endpoint alone, so this is still using HTTP.

What we see now

Now, we see readiness probe failures, but no health check failures any more.

In addition, we can see that our app is happily serving http request, while we see times where the readiness endpoint is not called.

Requests

Our conclusion (so far)

Our view on this:

  1. we see that the container is reachable and responsive;
  2. at the same time, the readiness check endpoint is NOT being served by the app.

This means that the kubelet, for some reason, is not able to send readiness probe http request to the container for some seconds, and then starts to get them trough later.

At the same time, request coming in from an external load balancer are still served.

As fare as we understand, the kubelet process runs on the node, so this looks like the networking for "node-local" request is unreliable.

This looks like a serious issue to us, as the health checks killed our app without good reason, repeatedly.


I understand that this problem looks a lot like people are using K8S for unstable apps and blame the platform instead of doing their homework β€” but, as far as I can tell, what we see proves that this is not the case for us and that there is a problem.

It would be great to get some feedback on this issue -- let me know if there is a misguided thought. This is complex stuff, after all. :)