monitoring: Observables with Warnings should require a Critical alert to be defined as well

tyates-indeed commented 3 years ago

Feature request description

Grafana contains many alerts across the various components of Sourcegraph. These alerts come in two levels: warning and critical. Many alerts only have warning level but no critical level. This makes it confusing for teams that manage Sourcegraph to understand the state of their instance. Every alert with a warning level should also have a corresponding critical level.

Is your feature request related to a problem? If so, please describe.

Due to noise in alerting, we only alerted on Slack when a critical alert fired. One alert, gitserver: 25+ repository clone queue size only has warning level and no critical level. This meant that the queue grew unbounded and we were not notified of the problem on Slack.

Describe alternatives you've considered.

For every alert in Grafana, ensure that both a warning and critical level are present.

Additional context

None

github-actions[bot] commented 3 years ago

Heads up @davejrt @ggilmore @daxmc99 @dan-mckean - the "team/distribution" label was applied to this issue.

github-actions[bot] commented 2 years ago

Heads up @davejrt @ggilmore @dan-mckean @caugustus-sourcegraph @stephanx - the "team/delivery" label was applied to this issue.

bobheadxi commented 2 years ago

My thoughts on this:

Critical alerts are supposed to be an immediate "something must be done right now" signal - this threshold is very difficult to define in a manner that works across all deployment types, all deployment sizes, all combinations of repos, etc. Requiring it might just lead to arbitrarily set thresholds that are not very useful either
https://docs.sourcegraph.com/admin/observability/alerting notes that warning alerts are worth subscribing too as well:

something could be wrong with Sourcegraph. We suggest checking in on these periodically, or using a notification channel that will not bother anyone if it is spammed. Over time, as warning alerts become stable and reliable across many Sourcegraph deployments, they will also be promoted to critical alerts in an update by Sourcegraph.

I think we should address including critical alerts on a case-by-case basis, and not require critical alerts everywhere. For example, gitserver clone queues growing doesn't strictly stop Sourcegraph from working - is that really a critical alert that needs to be addressed right now? I believe in this case the choice to have just a warning is pretty valid: this queue's size could point to a larger issue, and if the alerts persist should be investigated, but is not causing an immediate issue and could just as likely be e.g. that one has a deployment with a huge number of repositories that will always cause gitserver to have a backlog

sourcegraph / sourcegraph-public-snapshot