moudsen / mailGraph

Zabbix Media module and scripts for sending templated e-mail alerts enriched with multiple configurable graphs and associated event information
MIT License
31 stars 5 forks source link

Aggregation #27

Open splitice opened 2 years ago

splitice commented 2 years ago

Other than graphs there is one other big gap in Zabbix notification capability.

The ability to aggreate emails in the event that an issue triggers more than one. One common example of this is that the monitoring server has a connectivity interruption and outputs ICMP packet loss alerts for each host.

It would be nice if this could also be used to solve that problem, effectively taking it to a full fledged notification processor :)

moudsen commented 2 years ago

How would aggregation look like? Suppression of mails (by concluding a major disruptive event has happened)? There is logic included in Zabbix called "dependencies" that should prevent huge triggering events and only sent a simple "Host down" message. Can you give me a better description from a behavourial point of view so I can better understand how that would/should work?

splitice commented 2 years ago

The dependencies feature is well understood by us and many of those on the forums who have ran into issues in this area.

While it's possible a calculated item from the Zabbix server host mirrored to each host could work, it's crude and ugly.

A simple implementation on the mailGraph side would be to be able to define a trigger that when active triggers aggregation of events. The trigger thats checked for aggregation could then be on the Zabbix Server host, something like "ZabbixServer: Is healthy"

moudsen commented 2 years ago

Just looking back at your idea on this: I have already added the ability to include a collection of graphs in the same mail ... Take a look at https://github.com/moudsen/mailGraph/wiki/3--Zabbix-Tags-and-Macros spefically at mailGraph.screen for an Item or its associated host.

The logic behind it is that a Screen is a logic combination of graphs that you would use for troubleshooting. Adding it to the same mail makes troubleshooting easier (at least it does for me ...). It's trigger will look at the item, will find the "mailGraph.screen" tag and will pick up the graphs asociated to screen X.

Is that close to your request or would that need further enhancements? In the automation part I'm looking at "pick your screen" options through the Zabbix API to make configuration easier.

splitice commented 2 years ago

After re-reading the zabbix roadmap I wonder if Root Cause Analysis could provide something in this area.

What I would suggest is simply $trigger_ids= array('regex'=>[triggerId]) if alll trigger ids where regex matches trigger name are problem then new emails are not sent. Instead send an email announcing that "too many" alerts are being generated. When [triggerIds] go ok send an email with the aggregated graphs.

Unfortunately the more I think about it the more complex it seems (without a database).

moudsen commented 2 years ago

I agree to the "more complex" statement, however as I have a similar experience in one of my production level environments where a similar need is to suppress / collate messages, I may be able to address this in a different way.

What if you could attach a flag to a trigger that says "Collate next messages after this message into bundles of X minutes/hours" and if the concerning trigger is gone, normal operations continue? I can also do a "Suppress next messages after this message for X minutes/hours".

Would this work for you?

Next could be if "triggers A, B and C happen, collate or suppress for time X minutes/hours", but that could defeat the enhanced definition of triggers where "dependencies" are used as part of the Zabbix functionality ...

moudsen commented 1 year ago

Taking the concept of combining messages forward into a next release considering that further logic and reasoning is required to allow for this to work properly and where a single message could easily grow into a very big one.

The concept of blocking consecutive messages for a certain period of time is something I will take forward on shortner notice.