grafana / oncall

Developer-friendly incident response with brilliant Slack integration
GNU Affero General Public License v3.0
3.49k stars 288 forks source link

/health endpoint doesn't show actual backend health #946

Open PhantomPhreak opened 1 year ago

PhantomPhreak commented 1 year ago

Recently we had a problem with MySQL availbilty, getting following errors in the oncall-engine log

2022-12-05 05:13:58 source=engine:app google_trace_id=none logger=root inbound latency=0.00108 status=200 method=GET path=/health/ content-length=0 slow=0 
2022-12-05 05:13:58 source=engine:uwsgi status=200 method=GET path=/health/ latency=0.002326 google_trace_id=- protocol=HTTP/1.1 resp_size=223 req_body_size=0

2022-12-05 05:14:04 source=engine:app google_trace_id=none logger=root Start calculating latency for /integrations/v1/formatted_webhook/w13JXclvKXk1YYxdAe02ercEG/heartbeat/                               
2022-12-05 05:14:04 source=engine:app google_trace_id=none logger=engine.middlewares Cannot connect to database, assuming the request is not banned by default.
2022-12-05 05:14:04 source=engine:app google_trace_id=none logger=apps.integrations.mixins.alert_channel_defining_mixin AlertChannelDefiningMixin started
2022-12-05 05:14:04 source=engine:app google_trace_id=none logger=apps.integrations.mixins.alert_channel_defining_mixin Cannot connect to database, using cache to consume alerts!
2022-12-05 05:14:04 source=engine:app google_trace_id=none logger=django.request Internal Server Error: /integrations/v1/formatted_webhook/w13JXclvKXk1YYxdAe02ercEG/heartbeat/
Traceback (most recent call last):                 
  File "/usr/local/lib/python3.9/site-packages/django/db/models/fields/related_descriptors.py", line 173, in __get__                                                                                       
    rel_obj = self.field.get_cached_value(instance)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/fields/mixins.py", line 15, in get_cached_value                                                                                            
    return instance._state.fields_cache[cache_name]                                                                                                                                                        
KeyError: 'organization'                                                                   

During handling of the above exception, another exception occurred:                            

Traceback (most recent call last):                                                                                                                                                                         
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/base/base.py", line 219, in ensure_connection                                                                                            
    self.connect()                                                                                                                                                                                         
  File "/usr/local/lib/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner      
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/base/base.py", line 200, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/usr/local/lib/python3.9/site-packages/django/utils/asyncio.py", line 33, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/mysql/base.py", line 234, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/usr/local/lib/python3.9/site-packages/pymysql/connections.py", line 353, in __init__
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/pymysql/connections.py", line 633, in connect
    self._request_authentication()
  File "/usr/local/lib/python3.9/site-packages/pymysql/connections.py", line 907, in _request_authentication
    auth_packet = self._read_packet()
  File "/usr/local/lib/python3.9/site-packages/pymysql/connections.py", line 725, in _read_packet
    packet.raise_for_error()
  File "/usr/local/lib/python3.9/site-packages/pymysql/protocol.py", line 221, in raise_for_error
    err.raise_mysql_exception(self._data)
...
django.db.utils.OperationalError: (1045, "ProxySQL Error: Access denied for user 'oncall'@'<host>' (using password: YES)")
2022-12-05 05:14:04 source=engine:app google_trace_id=none logger=root inbound latency=0.107416 status=500 method=GET path=/integrations/v1/formatted_webhook/w13JXclvKXk1YYxdAe02ercEG/heartbeat/ content-l
ength=0 slow=0 integration_type=formatted_webhook integration_token=<>
2022-12-05 05:14:04 source=engine:uwsgi status=500 method=GET path=/integrations/v1/formatted_webhook/w13JXclvKXk1YYxdAe02ercEG/heartbeat/ latency=0.108391 google_trace_id=- protocol=HTTP/1.1 resp_size=38
0 req_body_size=0

Oncall plugin in Grafana WebUI reported following error:

An unknown error occured when trying to install the plugin. Are you sure that your OnCall API URL, https://<hostname>, is correct? Refresh your page and try again, or try removing your plugin configuration and reconfiguring.

As it shown above, /health/ endpoint responded with HTTP/200, when integration healthcheck was responding with HTTP/500, and WebUI didn't work.

/health/ endpoint is very useful for the backend availability checking, but now it's not showing a problem when one of the components is not available.

Matvey-Kuk commented 1 year ago

As well as I remember @iskhakov had some strong opinion why /health (https://github.com/grafana/oncall/blob/dev/engine/engine/views.py#L16) shouldn't check connectivity with other services. I don't remember details, the implementation we had before was doing a connectivity check with Rabbit MQ. For me now it sounds like it was more correct before.

PhantomPhreak commented 1 year ago

Based on the comment, i can guess it was made for the liveness/readiness probe, to de-couple oncall engine and the services it depends on, to avoid oncall's POD being restarted on the initialization step.

Oncall is a part of our alert delivery pipeline, if it's dead - we're blind. To avoid this, we have a cross-check between Oncall and our monitoring engine (CheckMK), so we could send an alert if Oncall is dead. Recently we had a situation, when Oncall's underlying services failed (see issue description or https://github.com/grafana/oncall/issues/800), but our monitorings was silent, because /health endpoint was returning OK

It would be nice to have an oncall healthcheck, where OK actually means "everyting is working fine, Oncall is ready to recieve and process alerts and issue the notifications", in other words - it's fully functional.

Grafana has 2 different heathcheck endpoints, /health and healthz, the last one covers the case, described in https://github.com/grafana/oncall/blob/dev/engine/engine/views.py#L16

Maybe it will be reasonable to apply similar logic for the Oncall as well

Thanks!