Open tyates-indeed opened 3 years ago
Heads up @davejrt @ggilmore @daxmc99 @dan-mckean - the "team/distribution" label was applied to this issue.
Heads up @davejrt @ggilmore @dan-mckean @caugustus-sourcegraph @stephanx - the "team/delivery" label was applied to this issue.
My thoughts on this:
warning
alerts are worth subscribing too as well:something could be wrong with Sourcegraph. We suggest checking in on these periodically, or using a notification channel that will not bother anyone if it is spammed. Over time, as warning alerts become stable and reliable across many Sourcegraph deployments, they will also be promoted to critical alerts in an update by Sourcegraph.
I think we should address including critical alerts on a case-by-case basis, and not require critical alerts everywhere. For example, gitserver clone queues growing doesn't strictly stop Sourcegraph from working - is that really a critical
alert that needs to be addressed right now? I believe in this case the choice to have just a warning
is pretty valid: this queue's size could point to a larger issue, and if the alerts persist should be investigated, but is not causing an immediate issue and could just as likely be e.g. that one has a deployment with a huge number of repositories that will always cause gitserver to have a backlog
Feature request description
Grafana contains many alerts across the various components of Sourcegraph. These alerts come in two levels:
warning
andcritical
. Many alerts only havewarning
level but nocritical
level. This makes it confusing for teams that manage Sourcegraph to understand the state of their instance. Every alert with awarning
level should also have a correspondingcritical
level.Is your feature request related to a problem? If so, please describe.
Due to noise in alerting, we only alerted on Slack when a
critical
alert fired. One alert,gitserver: 25+ repository clone queue size
only haswarning
level and nocritical
level. This meant that the queue grew unbounded and we were not notified of the problem on Slack.Describe alternatives you've considered.
For every alert in Grafana, ensure that both a
warning
andcritical
level are present.Additional context
None