louislam / uptime-kuma

A fancy self-hosted monitoring tool
https://uptime.kuma.pet
MIT License
52.83k stars 4.77k forks source link

Send ONE alert when many or all fail. or set one alert if many or all fail #4779

Closed DEKELP420 closed 1 month ago

DEKELP420 commented 1 month ago

šŸ“‘ I have found these related issues/pull requests

this part is good

šŸ·ļø Feature Request Type

Settings

šŸ”– Feature description

Example, I have hundreds of monitors. Most are set to send an SMS / Text on failure to multiple people. So, if say the internet goes down, most all the monitors will fail and want to send SMS alerts to 10 or more people. So, 2800 monitors out of 3230 end up sending SMS alerts to 10 people. So that is 28,000 SMS messages being sent at a cost of 2cents ea. It's not just the cost, its the " Cry wolf" issue I want to avoid. If someone gets 2800 SMS alerts all at once, they'll ignore or " its just crying wolf"

āœ”ļø Solution

The ability to set a limit on notifications or a way to "group" monitors so that if XX fail at same time to only send ONE notification. And check internet ( maybe ping a public ), if the internet is down, only monitors with " internet " enabled will send notifications. So, if 2800 monitors all fail AND internet connection is also = FAIL, then those notifications will be sent as one alert or none.
Or maybe you all already have a solution on the way?

ā“ Alternatives

No response

šŸ“ Additional Context

Basically, when XX or more monitors of a specific group fail within XX seconds of each-other, set to only one notification. ( XX = a user defined )

CommanderStorm commented 1 month ago

" Cry wolf" issue

Does not localise well. What do you mean here?

Send alert if ANY fail

Please use a group and only notify if the group goes down.

Send alert if MANY/ALL fail

Tracked in #3387 and related issues Requires #4395 to be merged first

DEKELP420 commented 1 month ago

thanks for the reply. I should have been more clear.

  1. " Cry Wolf" is a term used to describe constant false alarms, to the point that those tasked with listening to said alarm ignore future alerts. Its based off a kids story where a kid yells " WOLF!!' because she likes how everybody comes running to save her. So all day she yells out " WOLF!!", and ALL day the townspeople run to save her. Then, late that night, she is approached by a REAL wolf! A hungry big scary wolf!! She yells " WOLF!! " " WOLF!!".. Yet, nobody shows. Not one person in the town responds. They all figure its a false alarm, that she is calling out "wolf" to mess with them. Her dead tore up body is found the next day.

  2. Please use a group and only notify if the group goes down.

Thanks, again I should have explained it better.

I have 3000 monitors that each is set to " notify" ~ 10 people via SMS/TXT. So, if say the internet goes down where my Kuma is, Kuma will "see" all 3000 monitors down. It will then generate 30,000 SMS messages and send them all once the internet returns. This is a " false " failure, as none of the monitors actually went down. ( Yelling Wolf). And the 10 people assigned with getting the alerts now have ~3000 SMS messages pounding their phones (45 to 70 min of non-stop texts). So, understandably this would greatly concern them, as it would appear the whole system just turned in to scat. ( Wolf Scat to be exact :) ). And now 8 to 10 of these techs are in panic mode, running around, calling around, logging in to systems and generally in a rushed panic. Calling the CEO and CFO at 3AM to let them know its all messed up, EVERYTHING is messed up! We have to get to the office ASAP. Then, we discover it was just an internet hiccup, and everything is fine!

What do you think will happen the next time the internet goes out, or someone unplugs a switch to the Kuma Or any other local issue that would cause Kuma to send alerts for everything? Eventually, those tasked with getting the SMS/TXT will ignore it. And that one time, that one time everyone elected to ignore as its probably a "cry wolf" thing. Meanwhile, the datacenter is being fanned with wolf scat. The actual wolf scat hitting the fan and spraying all over.

What I'm looking for is kinda opposite of what you suggested. I'm looking for a solution that will remove false alerts. So, lets say if 10 or more concurrent alerts are detected, it could be a false alarm. Or maybe have a way to group multiple alerts? So instead of sending 10 SMS, group them up and send one master alert? Or perhaps a setting that confirms internet or "local up" by a ping. So that the internet goes out, and 3000 monitors also go out ( because they cant reach target with no internet), the system will not send SMS alerts. Perhaps a 2nd method would be used, like email.

I know this long read, sorry. I'm just trying to figure out how to best explain the issue. And I don't have a solid solution. :)

CommanderStorm commented 1 month ago

I think for your usecase, putting everything into one group and only notifiying for the group is reasonable. This way, you will get a notification if something in the group goes down. I would suggest dialing in the retry settings until the false-positives are acceptable.

What you are requesting is tracked in these issues (you can subscribe to them or ):

As said above, work in this area needs previous work to finish

=> closing as a duplicate, lets continue any discussions there ^^

[!NOTE] I am also going to question your core problem (sounds like a XY-Problem to me).. So you don't have an escalation policy set up and just ping all 10 engineers..

=> would stetting up an escallation policy via Grafana OnCall or similar not be a better choice? You could then escalate based on your business needs. Calling the CEO and CFO at 3AM can also be done for critical outages or security incidents as a last step in the policy...

[!TIP] About your 3k monitors: Please be aware of the performance issues in v1. See #4500 for further context. Sharding might be nessesary.