Closed chibenwa closed 2 days ago
one full -> degraded
FYI k8s consider a pod not healthy when receiving the response code >= 400.
James healthcheck response code:
200: All checks have answered with a Healthy or Degraded status. James services can still be used.
503: At least one check have answered with a Unhealthy status
degraded
with 200 code won't trigger k8s pod restart.
Then we should return unhealthy
instead? Not sure if it is a bit harsh. Anyway, docker and k8s liveness checks allow a number of failures (failureThreshold
defaults to 3) before restarting, therefore a bit more resiliency on the actual high IMAP load may be acceptable.
We could add a flag as a query parameter to consider degraded as failed. EG
GET 127.0.0.1:8000/healthcheck/checks/ImapCheck?strict
Would return 503 response code if unhealthy and degraded
While
GET 127.0.0.1:8000/healthcheck/checks/ImapCheck
Would return 503 when unhealthy and 200 for degraded.
We would need to implement GET 127.0.0.1:8000/healthcheck?strict
too.
Whould this solve your concern @quantranhong1999 ? We would get the best of both worlds...
Maybe this shall be a separate issue? Do you want to open it @quantranhong1999 ?
Maybe this shall be a separate issue? Do you want to open it @quantranhong1999 ?
Why?
One pod was stuck like this... So traffic was partially downgraded (1 pod failing). It do not happen a lot at all (3 month - 1 time)
While we should hopefully investigate seriously the root cause ( https://github.com/linagora/james-project/issues/5246 !) the topic is complex and I would like to have an operational alternative to counter this...
The idea: have a healthcheck that would be triggered and that we could aggregate in the liveness probe (CF https://github.com/linagora/james-project/issues/5244) the time that we actually fix the issue!
What
Add a heathcheck that for all IMAP servers ensures the reactive throttlers are not not.
Not full -> OK
one full -> degraded