sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

monitoring: Observables with Warnings should require a Critical alert to be defined as well #21891

Open tyates-indeed opened 3 years ago

tyates-indeed commented 3 years ago

Feature request description

Grafana contains many alerts across the various components of Sourcegraph. These alerts come in two levels: warning and critical. Many alerts only have warning level but no critical level. This makes it confusing for teams that manage Sourcegraph to understand the state of their instance. Every alert with a warning level should also have a corresponding critical level.

Is your feature request related to a problem? If so, please describe.

Due to noise in alerting, we only alerted on Slack when a critical alert fired. One alert, gitserver: 25+ repository clone queue size only has warning level and no critical level. This meant that the queue grew unbounded and we were not notified of the problem on Slack.

Describe alternatives you've considered.

For every alert in Grafana, ensure that both a warning and critical level are present.

Additional context

None

github-actions[bot] commented 3 years ago

Heads up @davejrt @ggilmore @daxmc99 @dan-mckean - the "team/distribution" label was applied to this issue.

github-actions[bot] commented 2 years ago

Heads up @davejrt @ggilmore @dan-mckean @caugustus-sourcegraph @stephanx - the "team/delivery" label was applied to this issue.

bobheadxi commented 2 years ago

My thoughts on this:

something could be wrong with Sourcegraph. We suggest checking in on these periodically, or using a notification channel that will not bother anyone if it is spammed. Over time, as warning alerts become stable and reliable across many Sourcegraph deployments, they will also be promoted to critical alerts in an update by Sourcegraph.

I think we should address including critical alerts on a case-by-case basis, and not require critical alerts everywhere. For example, gitserver clone queues growing doesn't strictly stop Sourcegraph from working - is that really a critical alert that needs to be addressed right now? I believe in this case the choice to have just a warning is pretty valid: this queue's size could point to a larger issue, and if the alerts persist should be investigated, but is not causing an immediate issue and could just as likely be e.g. that one has a deployment with a huge number of repositories that will always cause gitserver to have a backlog