grafana / oncall

Developer-friendly incident response with brilliant Slack integration
GNU Affero General Public License v3.0
3.54k stars 294 forks source link

Silences Not Working For Flapping Alerts #2018

Open kadaj666 opened 1 year ago

kadaj666 commented 1 year ago

I am encountering an issue with the silencing functionality in the oncall application. It seems that the silences do not work effectively for flapping alerts, i.e., the alerts that frequently change state. Despite setting these alerts to "silence", they continue to generate notifications, which leads to unnecessary noise and disturbance.

Steps to reproduce:

  1. Identify a flapping alert (an alert that rapidly switches between states).
  2. Attempt to silence this alert.
  3. Alert state is switched to resolved.
  4. Silence is auto removed for this allert.
  5. New alert incoming for same group and labels.

Suggestions:

  1. Once an alert is silenced, keep state until the silence period is over or the silence is manually removed.
  2. Use external silence. For example Grafana Alerts in payload provide silenceURL with all required parameters for silence

as a fix i can add this link in template but it's not very convenient because you need to open browser, login, then set silence and the purpose of the silence buttons in the messenger is lost.

example of external silence url:

"silenceURL": "https://your.grafana/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DBlackboxExporter+availability&matcher=grafana_folder%3DRules&matcher=instance%3Dhttps%3A%2F%2Fsome.domain.com%2Fstatus&matcher=job%3DBlackbox_HTTPS&matcher=project%3DTest&matcher=service%3DBlackbox_HTTPS&matcher=severity%3Dwarning",

just need to add additionally silence duration and submit silence.

image

Matvey-Kuk commented 1 year ago

This is about Grafana Alerting, not about Grafana OnCall. I suggest opening issue at the grafana repo.

kadaj666 commented 1 year ago

@Matvey-Kuk sorry maybe i misunderstood the problem, but in grafana alerts silence work right. Problem is in grafana oncall, here is how to reproduce the problem:

  1. Autoresolve is enabled
  2. Recieve an alert that semaphores, let's say by cpu load
  3. Set silence in oncall for 1h
  4. After 5 minutes cpu metrics is back to normal, alert is set as resolved, silence in oncall for some reason disapear
  5. Recive again this alert as firing and recive notification again and so on in a circle until the alert stops semaphoring
Konstantinov-Innokentii commented 1 year ago

Reopening, because I had same feedback from @colega. I think it worth to discuss, how we can improve this system. In AlertManager silence will silence resolve signals also, which prevents such flapping alerts. Should we have same behaviour for OnCall?

Alan01252 commented 1 year ago

I think I'm experiencing the same issue.

image

Each of those "Resolved" alerts should have been silenced for 24 hours rather than resolving themselves and reopening.

Fyi I'm also using the cloud offering, my work flow was, to see the alert in slack, realise it was for something the customer is currently working on, and select 24 hour drop down. Get surprised when woken up during the night ;)