Open c0c0n3 opened 5 years ago
I think it happens when QL becomes unresponsive, and so it's killed by k8s:
Warning Unhealthy 54m (x1061 over 4d21h) kubelet, ip-172-20-60-68.eu-central-1.compute.internal Liveness probe failed: Get http://172.20.44.1:8668/v2/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Normal Pulling 54m (x101 over 4d21h) kubelet, ip-172-20-60-68.eu-central-1.compute.internal pulling image "smartsdk/quantumleap:rc"
Normal Killing 54m (x100 over 4d21h) kubelet, ip-172-20-60-68.eu-central-1.compute.internal Killing container with id docker://quantumleap:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 53m (x101 over 4d21h) kubelet, ip-172-20-60-68.eu-central-1.compute.internal Successfully pulled image "smartsdk/quantumleap:rc"
Normal Created 53m (x101 over 4d21h) kubelet, ip-172-20-60-68.eu-central-1.compute.internal Created container
Normal Started 53m (x101 over 4d21h) kubelet, ip-172-20-60-68.eu-central-1.compute.internal Started container
Warning Unhealthy 53m (x3 over 4d21h) kubelet, ip-172-20-60-68.eu-central-1.compute.internal Liveness probe failed: Get http://172.20.44.1:8668/v2/health: dial tcp 172.20.44.1:8668: connect: connection refused
Warning Unhealthy 8m50s (x1127 over 4d21h) kubelet, ip-172-20-60-68.eu-central-1.compute.internal Readiness probe failed: Get http://172.20.44.1:8668/v2/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
i believe this was solved with allowing for yellow state of crate cluster.
@c0c0n3 We are facing this issue in our Kubernetes deployment of quantumleap with wq configuration. We have used below liveness probe setting in both deployment files for quantumleap and quantumleap-wq:
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 180
periodSeconds: 60
successThreshold: 1
httpGet:
path: /health
port: 8668
scheme: HTTP
timeoutSeconds: 60
Please find below observation:
status:pass
Liveness probe failed for quantumleap-wq:
We have checked our environment for crate health, and it is GREEN
.
If we remove livenessProbe
from quantumleap-wq deployment file, then it does not restart.
Please confirm our understanding: livenessProbe
is not required in quantumleap-wq deployment file.
@c0c0n3 we have following observation on why livenessProbe is not working with quantumleap-wq deployment file:
As two deployments of quantumleap is running in our environment, one is for master and other is for worker.
In master deployment file we can have livenessProbe which will call quantumleap's health API and restart pod if any error occurs.
Whereas worker quantumleap can only handle notify API as mentioned in https://github.com/orchestracities/ngsi-timeseries-api/blob/master/docs/manuals/admin/wq.md.
As per our understanding, health API of quantumleap cannot be executed on worker quantumleap and it returned connection error and restarts. We can remove livenessProbe from worker quantumleap because it is already handled in master quantumleap which will check crate status.
Please correct my understanding if there is anything I am missing.
hi @pooja1pathak :-)
health API of quantumleap cannot be executed on worker quantumleap and it returned connection error and restarts.
Correct. Each Worker process is a standalone RQ instance, there's no QL Web API there:
We can remove livenessProbe from worker quantumleap
Yes. I suggest you start Workers using Supervisor with our config:
That will give you the reliability you're after I guess.
More about it here:
Hope this helps!
We've been experiencing an unusually high number of restarts in our K8s cluster. For example in the last 3 days K8s restarted QL 103 and 99 times in each of the two pods, respectively.