Open PhantomPhreak opened 1 year ago
As well as I remember @iskhakov had some strong opinion why /health (https://github.com/grafana/oncall/blob/dev/engine/engine/views.py#L16) shouldn't check connectivity with other services. I don't remember details, the implementation we had before was doing a connectivity check with Rabbit MQ. For me now it sounds like it was more correct before.
Based on the comment, i can guess it was made for the liveness/readiness probe, to de-couple oncall engine and the services it depends on, to avoid oncall's POD being restarted on the initialization step.
Oncall is a part of our alert delivery pipeline, if it's dead - we're blind. To avoid this, we have a cross-check between Oncall and our monitoring engine (CheckMK), so we could send an alert if Oncall is dead.
Recently we had a situation, when Oncall's underlying services failed (see issue description or https://github.com/grafana/oncall/issues/800), but our monitorings was silent, because /health
endpoint was returning OK
It would be nice to have an oncall healthcheck, where OK
actually means "everyting is working fine, Oncall is ready to recieve and process alerts and issue the notifications", in other words - it's fully functional.
Grafana has 2 different heathcheck endpoints, /health
and healthz, the last one covers the case, described in https://github.com/grafana/oncall/blob/dev/engine/engine/views.py#L16
Maybe it will be reasonable to apply similar logic for the Oncall as well
Thanks!
Recently we had a problem with MySQL availbilty, getting following errors in the oncall-engine log
Oncall plugin in Grafana WebUI reported following error:
As it shown above,
/health/
endpoint responded with HTTP/200, when integration healthcheck was responding with HTTP/500, and WebUI didn't work./health/
endpoint is very useful for the backend availability checking, but now it's not showing a problem when one of the components is not available.