Too many restarts in K8s cluster deployment

c0c0n3 commented 5 years ago

We've been experiencing an unusually high number of restarts in our K8s cluster. For example in the last 3 days K8s restarted QL 103 and 99 times in each of the two pods, respectively.

chicco785 commented 5 years ago

I think it happens when QL becomes unresponsive, and so it's killed by k8s:

  Warning  Unhealthy  54m (x1061 over 4d21h)    kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Liveness probe failed: Get http://172.20.44.1:8668/v2/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Pulling    54m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  pulling image "smartsdk/quantumleap:rc"
  Normal   Killing    54m (x100 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Killing container with id docker://quantumleap:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Pulled     53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Successfully pulled image "smartsdk/quantumleap:rc"
  Normal   Created    53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Created container
  Normal   Started    53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Started container
  Warning  Unhealthy  53m (x3 over 4d21h)       kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Liveness probe failed: Get http://172.20.44.1:8668/v2/health: dial tcp 172.20.44.1:8668: connect: connection refused
  Warning  Unhealthy  8m50s (x1127 over 4d21h)  kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Readiness probe failed: Get http://172.20.44.1:8668/v2/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

chicco785 commented 5 years ago

i believe this was solved with allowing for yellow state of crate cluster.

pooja1pathak commented 4 weeks ago

@c0c0n3 We are facing this issue in our Kubernetes deployment of quantumleap with wq configuration. We have used below liveness probe setting in both deployment files for quantumleap and quantumleap-wq:

        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 180
          periodSeconds: 60
          successThreshold: 1
          httpGet:
            path: /health
            port: 8668
            scheme: HTTP
          timeoutSeconds: 60

Please find below observation:

Quantumleap pod does not restart
Quantumleap-wq pod restarts many times
Liveness probe failed for quantumleap-wq pod
Health API returns status:pass

Liveness probe failed for quantumleap-wq: health

We have checked our environment for crate health, and it is GREEN. If we remove livenessProbe from quantumleap-wq deployment file, then it does not restart.

Please confirm our understanding: livenessProbe is not required in quantumleap-wq deployment file.

pooja1pathak commented 3 weeks ago

@c0c0n3 we have following observation on why livenessProbe is not working with quantumleap-wq deployment file:

As two deployments of quantumleap is running in our environment, one is for master and other is for worker.

In master deployment file we can have livenessProbe which will call quantumleap's health API and restart pod if any error occurs.

Whereas worker quantumleap can only handle notify API as mentioned in https://github.com/orchestracities/ngsi-timeseries-api/blob/master/docs/manuals/admin/wq.md.

As per our understanding, health API of quantumleap cannot be executed on worker quantumleap and it returned connection error and restarts. We can remove livenessProbe from worker quantumleap because it is already handled in master quantumleap which will check crate status.

Please correct my understanding if there is anything I am missing.

c0c0n3 commented 5 days ago

hi @pooja1pathak :-)

health API of quantumleap cannot be executed on worker quantumleap and it returned connection error and restarts.

Correct. Each Worker process is a standalone RQ instance, there's no QL Web API there:

We can remove livenessProbe from worker quantumleap

Yes. I suggest you start Workers using Supervisor with our config:

https://github.com/orchestracities/ngsi-timeseries-api/blob/master/src/wq/supervisord.conf

That will give you the reliability you're after I guess.

More about it here:

https://github.com/orchestracities/ngsi-timeseries-api/wiki/Work-a-Q

Hope this helps!

orchestracities / ngsi-timeseries-api

Too many restarts in K8s cluster deployment #179