Closed yuripastushenko closed 1 year ago
/sig network
/triage unresolved
Comment /remove-triage unresolved
when the issue is assessed and confirmed.
π€ I am a bot run by vllry. π©βπ¬
Are you sure the application don't hit the resources limits?
In my case, the application starting fine, then the container start using more resources until he hit the limit. After that the readiness probes fail
Yes, I am sure. I collected the metrics. Cpu usage on pod was around 0.001 Ram - around 4mb. This is a test application (pure nginx image) that did not do anything (no traffic is sent to it).
/area kubelet
any update on this? we have been facing similar issues since few weeks
This is the first such report I have seen. There's nothing obvious about why this would happen.
It's possible kubelet is too busy and starved for CPU and the probe happened to be thing that got thrashed. How many pods are on this machine? How busy is it?
It's possible the node itself is thrashing or something and OOM behavior is weird. Does dmesg show any OOMs?
It's possible some other failure down deep in kubelet is being translated into this? You could try running kubelet at a higher log level to get more details on what is happening.
A lot of bugs have been fixed since 1.12, so we'd need to try to reproduce this and then try again in a more recent version. Is there any way you can help boil down a simpler reproduction?
@thockin
If this issue has been triaged, please comment /remove-triage unresolved
.
If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted
label.
π€ I am a bot run by vllry. π©βπ¬
@thockin
returns-console-558995dd78-b6bjm 0/1 Running 0 23h
For example, if you see the above get pods, readiness probe
of one pod is failing almost every day. A restart would fix this. However, we're unable to find the root cause. But I don't see any abnormal numbers on CPU or memory or thread count. Doing a describe pod
would give me that readiness
has failed. How do we debug in such scenario?
Also, this is happening for this particular deployment only. The other pods are running fine.
"returns-console-558995dd78-pbjf8 1/1 Running 0 23h returns-console-558995dd78-plqdr 1/1 Running 0 23h returns-console-558995dd78-tlb5k 1/1 Running 0 23h returns-console-558995dd78-tr2kd 1/1 Running 0 23h returns-console-558995dd78-vfj2n 1/1 Running 0 23h"
We too are facing the same issue in our cluster.
Kubernetes Version - 1.16.8 Kernel - 4.4.218-1 OS - Centos 7
Installed using Kubespray
In our case, timeouts were not related to application but related to specific nodes in cluster. In our 13 node cluster, node 1 to 4 had some kind of issue wherein the pods running on these nodes had random failures due to timeouts.
Checked that there weren't any cpu aur memory usage spikes also.
P.S We are using NodePort for production use case. Is it possible that the node port setup cannot handle too many socket connections?
I have no idea what might cause spurious probe failues. @derekwaynecarr have you heard any other reports of this?
@tarunwadhwa13 are you saying PROBES failed (always same node) or user traffic failed? If you have any other information about what was going on with those nodes when the failures happen, it would help. Check for OOMs, high CPU usage, conntrack failures?
@thockin Conntrack shows hardly 2 or 3 errors. Memory consumption is 60-65% per node.
Just found that the timeouts were for almost all request and not just probe. We added istio lately to check connection stats and understand if the behaviour was due to application. But the findings were weird. Istio itself is now failing readiness probe quite frequently
157 failures in ~3 hours
Would like to add that kubernetes is running in our Datacenter. And since iptables version is 1.4.21, --random-fully is not being implemented. But since all machines have same configuration, we ruled out this possibility
I apologize for not having a lot of details to share but I'd add my 2 cents. We recently upgraded from Istio 1.4.4 to 1.5.4 and started seeing the issues described by OP. Lots of liveness / readiness issues there were not happening before. It SEEMS like adding let's say a 20sec initial delay had helped in most cases. At this point we are still seeing it and not sure what the root cause is.
running on EKS 1.15 (control plane) / 1.14 managed nodeGroups
@thockin
If this issue has been triaged, please comment /remove-triage unresolved
.
If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted
label.
π€ I am a bot run by vllry. π©βπ¬
I'm afraid the only way to know more is to get something like a tcpdump from both inside and outside the pod, which captures one or more failing requests.
Possible?
I'm getting same issue. Couple Nginx+PHP Pods running on huge instances in parallel with couple other small applications. These are staging apps+nodes without constant traffic. I constantly receive notification that these Nginx+PHP app has restarted... specifically Nginx container of these Pods. At the same time other apps running in the same namespace, nodes never restart.
k -n staging get events
...
22m Warning Unhealthy pod/myapp-staging3-59f75b5d49-5tscw Liveness probe failed: Get http://100.100.161.197:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
23m Warning Unhealthy pod/myapp-staging4-7589945cc6-2nf4s Liveness probe failed: Get http://100.100.161.198:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
11m Warning Unhealthy pod/myapp-staging5-84fb4494db-5dvph Liveness probe failed: Get http://100.104.124.220:80/list/en/health/ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
...
Liveness on an Nginx container looks like this:
liveness:
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
httpGet:
path: /list/en/health/ping
port: 80
UPD: Strange thing is that completely distinct deployments you can see above staging4 staging5 stagingN - above 10 deployments fail at once.
My possible problem might be in missing timeoutSeconds
which is default 1s
Having the same problem here.
livenessProbe:
httpGet:
path: /status
port: 80
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 3
Error cca 3 - 10 times a day:
Liveness probe failed: Get http://10.244.0.154:8002/status: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The service operates normally and responds to /status in 5ms, though.
β οΈ Also, and that is a bigger problem, at similar random times some pods refuse to connect to each other.
Running on Azure Kubernetes Service V 1.14.8
I'm facing the same issue as well, increasing timeoutSeconds didn't help.
livenessProbe:
httpGet:
path: /ping
port: http
failureThreshold: 5
initialDelaySeconds: 5
periodSeconds: 20
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /ping
port: http
initialDelaySeconds: 5
periodSeconds: 20
timeoutSeconds: 10
Runing on Kubernetes v1.16.7 on AWS deployed via KOPS
I appreciate the extra reports. It sounds like something is really weird. I'll repeat myself from above:
I'm afraid the only way to know more is to get something like a tcpdump from both inside and outside the pod, which captures one or more failing requests.
Is that possible? Without that I am flying very blind. I don't see this experimentally and I'm not flooded with reports of this, so it's going to be difficult to pin down. If you say you see it repeatedly, please try to capture a pcap?
@thockin I'll try to get a dump if I'm able to replicate this issue consistently, since it tends to happen randomly. Just to clarify, what exactly did you mean by tcpdump outside the pod? Did you mean tcpdump of the node where the pod resides?
Sorry, I'm relatively new to this :sweat_smile:
@thockin
If this issue has been triaged, please comment /remove-triage unresolved
.
If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted
label.
π€ I am a bot run by vllry. π©βπ¬
/remove-triage unresolved
We are also seeing this issue - trying to run a Django+WSGI+nginx server that works on a lower version of K8s, when we try this on Linode's managed Kubernetes service - LKE we see this
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m45s default-scheduler Successfully assigned be/bigbitbus-stack-64c7b65b97-c2cth to lke6438-8072-5eecc7cc4d98
Normal Pulled 2m45s kubelet, lke6438-8072-5eecc7cc4d98 Container image "##############/bigbitbus/bigbitbus-$$$$$$-server:###########" already present on machine
Normal Created 2m44s kubelet, lke6438-8072-5eecc7cc4d98 Created container bbbchart
Normal Started 2m44s kubelet, lke6438-8072-5eecc7cc4d98 Started container bbbchart
Warning Unhealthy 9s kubelet, lke6438-8072-5eecc7cc4d98 Liveness probe failed: Get http://10.2.1.32:80/api/jindahoon/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
UPD: So since my last comment, I've updated timeouts to something more realistic. e.g.:
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
httpGet:
path: /health/ping
port: 80
Everything was quiet since then. But the day before yesterday, and yesterday: My prometheus-operator> alertmanger and slack have have exploded. And today by magic it is silent again not a single alert.
My apps is a usual PHP+Nginx pod. Staging environments deployed on 3 huge boxes. Just a dozen preview deploys which are deployed only once, and then opened by devs only once. No load at all. For example, the app had 10 staging deploys. so app-staging1..10 And all containers of exactly that one app were failing all at once. i repeat virtually not related deployments spread on 3 nodes. In pod, only Nginx containers were restarting. In logs, I was able to catch that actual Nginx was answering 499 to the liveness probe. Nothing else. Still investigating.
While ~5 other PHP apps deploys on the same staging nodes never restart.
I am also facing the same issue. Running on Kubernetes v1.16.8 on AWS EKS. Any workarounds?
I am also facing the same issue with AKS 1.16.9. This is occuring consistently. Is there any way to investigate the issue?
@zaabskr
Looks like this error message is a symptom of a more general problem of node-to-pod connectivity issues.
In our case, we confirmed that by using traceroute
from the node to the pod's internal IP as suggested in https://discuss.kubernetes.io/t/who-where-actually-work-liveness-probe-in-kubernetes/4771/6. That IP was perfectly routable from the other pods and even from many other nodes, but not from that particular node it ended up on. The root cause turned out to be that the br-int (bridge) interface that connects the pod networks and the NSX overlay logical switch was failing to come up because of missing static routes. Fixing that magically solved the problem.
In other cases, the root cause may be different, but the first step would be to confirm whether the node can actually access that pod, e.g. ssh and traceroute or curl into the readiness probe (curl -v http://172.70.3.3:15020/healthz/ready
when Istio sidecar containers fail to get ready), etc.
experiencing this issue in multiple GKE clusters. just started about 5 days ago. right now i'm having to drain the bad node and remove it from the pool. running 1.16.9-gke.6
My issue was in a REST API dependency that was checked by readinessProbe. Doubles readinessTimeout helped.
What happened: After updating EKS from v1.14.9-eks-f459c0 to v1.15.11-eks-065dce some of our ReplicaSets and DaemonSets Liveness/Readiness probes started failing with Liveness probe failed: Get http://xxx.xxx.xxx.xxx:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers
It seems that random pods are affected. I can successfully curl to them from other pods.
All nodes and pods are fine, not CPU/Memory/Storage issues.
Two nodes are running 1.15 and the rest 1.14. There doesn't seem to be a correlation between node version and the issue.
Nothing obvious in kubectl cluster-info dump
or journalctl -r
or kube-proxy.log
Probes are mostly configured like this (truncated):
livenessProbe:
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
I will increase the timeouts but it would be good to solve this. What changed in 1.15?
What you expected to happen: Liveness/Readiness probes not to fail.
How to reproduce it (as minimally and precisely as possible): Upgrade EKS 1.14 to 1.15
Environment:
Kubernetes version (use kubectl version
):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-065dce", GitCommit:"065dcecfcd2a91bd68a17ee0b5e895088430bd05", GitTreeState:"clean", BuildDate:"2020-07-16T01:44:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: EKS
OS (e.g: cat /etc/os-release
):
1.14:
ami-048d37e92ce89022e amazon-eks-node-1.14-v20200507
$ uname -a
Linux ip-xxxxxxx.xxxxxx.compute.internal 4.14.177-139.253.amzn2.x86_64 #1 SMP Wed Apr 29 09:56:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
1.15
$ ami-0c42d7bd0e31ee2fe amazon-eks-node-1.15-v20200814
$ uname -a
Linux ip-xxxxxx.xxxxxx.compute.internal 4.14.186-146.268.amzn2.x86_64 #1 SMP Tue Jul 14 18:16:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
@zaabskr
Looks like this error message is a symptom of a more general problem of node-to-pod connectivity issues.
In our case, we confirmed that by using
traceroute
from the node to the pod's internal IP as suggested in https://discuss.kubernetes.io/t/who-where-actually-work-liveness-probe-in-kubernetes/4771/6. That IP was perfectly routable from the other pods and even from many other nodes, but not from that particular node it ended up on. The root cause turned out to be that the br-int (bridge) interface that connects the pod networks and the NSX overlay logical switch was failing to come up because of missing static routes. Fixing that magically solved the problem.In other cases, the root cause may be different, but the first step would be to confirm whether the node can actually access that pod, e.g. ssh and traceroute or curl into the readiness probe (
curl -v http://172.70.3.3:15020/healthz/ready
when Istio sidecar containers fail to get ready), etc.
For those of us running a managed service....this is not something we should ever have to deal with. What version and where are you running k8s?
I am using AWS EKS 1.16 and have been getting the same problems recently.
Update: After two days of checking everything (including cluster dump) the issue was with the application/pod itself. The application was failing to respond to Liveness/Readiness probes (in a specific scenario) and it was killed. Initially no obvious reason for this was found, and it correlated with EKS upgrade, so we blamed (wrongly) eks 1.15.
initialDelaySeconds didn't help as the pod was fine starting up, timeoutSeconds did help.
Things to test:
kubectl exec -it <working_pod_with_curl> -- sh -c 'for run in $(seq 1 10); do sleep 1; curl -s http://<whatever probes are testing> "$run"; done'
Dealing with the same issue on a GKE cluster, except with a twist. Not to add to the fire but maybe help other's have the same issue.
> kubectl version
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.11-gke.5", GitCommit:"baccd25d44f1a0d06ad3190eb508784efbb990a5", GitTreeState:"clean", BuildDate:"2020-06-25T22:55:26Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}
I have a spring-boot app with actuator - running actuator on a separate port-
name: gateway
ports:
- containerPort: 8080
name: http-server
protocol: TCP
- containerPort: 8000
name: http-actuator
protocol: TCP
livenessProbe:
failureThreshold: 6
httpGet:
path: /actuator/health
port: http-actuator
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 6
httpGet:
path: /actuator/health
port: http-actuator
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
In this setup, using named container ports, it events:
31s Warning Unhealthy pod/gateway-547c5fb7b-rdtkt Readiness probe failed: Get http://10.52.2.193:8000/actuator/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
31s Warning Unhealthy pod/gateway-547c5fb7b-rdtkt Liveness probe failed: Get http://10.52.2.193:8000/actuator/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The following works:
kubectl port-forward gateway-547c5fb7b-rdtkt 8000:8000
curl 127.0.0.1:8000/actuator/health
# AND
kubectl exec -t busybox-b879ddf9c-jv4mw -- curl 10.52.2.193:8000/actuator/health
When removing the use of named containerPorts for readiness/liveness it passed the checks:
name: gateway
ports:
- containerPort: 8080
name: http-server
protocol: TCP
livenessProbe:
failureThreshold: 6
httpGet:
path: /actuator/health
port: 8000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 6
httpGet:
path: /actuator/health
port: 8000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
I haven't been able to find logs/records of what endpoint it's calling, but I suspect it's still pod IP + port, but for some reason it works?
I'm also seeing this 3+ times a day after upgrading my cluser to 1.16.11-gke.5 on GKE a few weeks ago. I ignored it for a while but the past few days I've been running load tests against my services with k6 and I'm seeing sporadic latency in my cluster. I'm beginning to think it's related.
After a full day spent trying to narrowing it down I can confirm the latency is any network connection between any pod and its connection to redis, cloudsql, or one another over HTTP.
I'm still investigating, but I've shut down all external logging (Datadog) and when under moderate load it shows up every 5-6 seconds with a 200-1500ms delay affecting any and all pods.
I can't speak for the loads these other systems are under, but my guess is the health check readiness timeout is a symptom of a more pervasive networking issue.
EDIT: I should add, the readiness timeouts occur when run against my Nginx Ingress which was installed via Helm - not against a pod containing a container I built or configured readiness probes for.
Now facing this mainly for the tiller pod (helm), all other pods are generally working fine. I did shutdown all the EKS worker nodes and then recreated, that seems to solve the issue. Looks like a kubernetes 1.16 issue to me.
I can't share a full tcpdump of our internal services, but I have done tcpdumps and have some more to share about this issue. The first is I upgraded my cluster to GKE cluster to 1.17.9-gke.1503, and upgraded my nginx ingress via Helm (ingress-nginx/ingress-nginx) to chart version ingress-nginx-2.15.0. I'm still seeing an occasional timeouts but I still don't believe this is isolated to the ingress.
When I have the system under moderate load I can see the my service throughput bottoms out occasionally and it's affecting every connection in the cluster. With throughput is consistent my service response time is averaging ~50ms, when throughput bottoms out it jumps to 700+ms.
This shows throughput from Service-A to Service-B:
And this shows throughput from Service-B to its highly available Redis store:
At the time of the capture everything had been cached in Redis, but the same issues are seen when querying the managed database. The throughput drops also line up with TCP DUP ACK's to our external logging platform.
I'm trying to wade through tcpdumps of kube-dns and kube-proxy but they're a lot to look at.
This is our staging cluster, a 4 node n1-standard-1 cluster all running in us-central1-c where I have limited the number of service instances to 1 pod per service. However, the problem is consistent on the more beefy production cluster where all pod's throughput drops at fairly consistent intervals.
We had this same problem that ended up being due to our pod consuming more CPU than allowed by the pod's resources limits cpu. It seemed this prevented networking in general to be blocked to the pod during the high CPU time.
@scolbeck we faced that issue, but in our case(k8s 1.19.2, Ubuntu 18.04) we disabled CPU limits completely(removed them from deployment descriptors and kubelet is running with --cpu-cfs-quota=false
) - that didn't help.
@CharlyRipp do I understand correctly that for you issue was resolved when instead of named ports you started to use actual ports in probes?
Hi ! I am facing the same issue, I have ever happened the problem on GKE 1.14&1.15 . My cluster have two deployment with nginx container , these deployment have set liveness/readiness probe. Only one deployment(six replica) sometimes all pods get unhealth at the same time.
Liveness probe failed: Get http://10.28.6.19:80/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get http://10.28.6.19:80/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Is possible the issue happened on nginx container with liveness/readiness probe?
Hi!
Same issue here:
I'm getting "Liveness probe failed: Get XXX net/http: request canceled (Client.Timeout exceeded while awaiting headers)" randomly every 3h o 4h for some of our k8s applications (not readiness problems). In our case, we check that:
Is not disruptive as you can see, but we have random alerts in out monitoring systems. So it would be nice to find the reason. If I find something more, will report.
EDIT: Modified timeoutSeconds on Liveness but still getting errors. Keep looking...
Same problem here:
Readiness probe failed: Get http://xxxxx:80/healthcheck/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Running the same request via a pod in the same namespace curl http://xxxxx:80/healthcheck/
returns a 200 response.
EDIT: Even setting longer timeouts behaviour is extremely inconsistent (No Readiness probe failures to several counts)
initialDelaySeconds: 6
timeoutSeconds: 10
Ran into a similar issue this week after upgrading EKS 1.17 to 1.18 where my grafana cluster would come partially up. 2 of the 3 pods would return with liveness or readiness probe failed connection refused. I tried many different solutions with no luck.
Eventually, the only thing that seem to fixed the issue was performing a rolling restart of the worker nodes as suggested by @sownak Now all the pods are coming up successfully even after I manually terminate them.
We too are facing the same issue in our cluster.
Please make sure your all working nodes can access the Internet. I also encountered a similar problem.
$ kubectl get node
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 16h v1.18.5
k8s-work1 Ready <none> 16h v1.18.5
k8s-work2 Ready <none> 15h v1.18.5
The reason is: k8s-master and k8s-work2 can go online k8s-work1 no internet
@yuripastushenko Hello is now the problem solved?
Same issue, sometimes I get this K8S event:
Liveness probe failed: Get http://192.168.87.238:8100/status: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
We're also seeing the same issue: Random failures of the Liveness and Readiness probes (that we are very very sure are not rooted in the application) with net/http: request canceled (Client.Timeout exceeded while awaiting headers)
using Kubernetes 1.15
Managed EKS at AWS: Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-065dce", GitCommit:"065dcecfcd2a91bd68a17ee0b5e895088430bd05", GitTreeState:"clean", BuildDate:"2020-07-16T01:44:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
Adding to this: (@cseifert-betterdoc and me both work on the same issue within our K8s / EKS cluster, so the version etc. poste apply to this)
I think we have a proof that there is a networking issue between the Kubelet and the container.
We have had
We have had problems with our cluster, as is restarted contains seemingly random, giving failing health checks as reason.
Of course, we started to dig into why our container fails (as, we all know, "SQL is not broken"). After a lot of deep and thought digging into metrics and logs, we came to the conclusion that our container (running a rail based web app) is just fine.
We concluded that our network setup is somehow flawed (being strange, given that we use EKS for a reasonβ¦), so we switched the health check to a simple exec
command to see if this mitigates the problem.
We left the readiness endpoint alone, so this is still using HTTP.
Now, we see readiness probe failures, but no health check failures any more.
In addition, we can see that our app is happily serving http request, while we see times where the readiness endpoint is not called.
Our view on this:
This means that the kubelet
, for some reason, is not able to send readiness probe http request to the container for some seconds, and then starts to get them trough later.
At the same time, request coming in from an external load balancer are still served.
As fare as we understand, the kubelet
process runs on the node, so this looks like the networking for "node-local" request is unreliable.
This looks like a serious issue to us, as the health checks killed our app without good reason, repeatedly.
I understand that this problem looks a lot like people are using K8S for unstable apps and blame the platform instead of doing their homework β but, as far as I can tell, what we see proves that this is not the case for us and that there is a problem.
It would be great to get some feedback on this issue -- let me know if there is a misguided thought. This is complex stuff, after all. :)
What happened: In my cluster sometimes readiness the probes are failing. But the application works fine.
What you expected to happen: Successful Readiness Probe.
How to reproduce it (as minimally and precisely as possible): We have few clusters with different workloads. Only in cluster with big number of short living pods we have this issue. But not on all nodes. We can't reproduce this error on other our clusters (that have same configuration but different workload). How i found the issue? I deployed a daemonset:
Then i started to listen events on all pods of this daemonset. After a couple of time i received next events:
Those events where on ~50% of pods of this daemonset.
On the nodes where the pods with failed probes was placed, I collected the logs of kubelet. And there was errors like:
I was thinking that sometimes the nginx in pod really response slowly. For excluding this theory, I created a short script that curl the application in pod and store response time in a file:
I run this script on node where the pod is placed for 30 minutes and i get the following:
There was
482670
measurements and the longest response time was0.007
.In logs of pod there are only message with response code 200 (from my requests and requests of readiness probes):
It means that part of probes are successful.
Then i stopped the curl script (because the big number of logs). I waited while new error with failed probe appears in kubelet logs.
And in logs of that pod with nginx I didn't find the request generated by this probe:
If I restart the kubelet - the error don't disappear. Have someone any suggestions about this?
Environment:
cat /etc/os-release
): ubuntu 16.04uname -a
): 4.15.0-66-genericSeems like the problem appears in many different installations - https://github.com/kubernetes/kubernetes/issues/51096
/sig network