maxwo / snmp_notifier

A webhook to relay Prometheus alerts as SNMP traps, because sometimes, you have to deal with legacy
Apache License 2.0
57 stars 34 forks source link

alert attached to the resolution trigger #206

Open fabienmagagnosc opened 2 months ago

fabienmagagnosc commented 2 months ago

What did you do?

I use 2 templates to generate 2 fields to allow automatic alarm resolution :

_ a default one, to provide a status FAULT or OK

{{- if .Alerts -}} FAULT {{ else -}} OK {{- end -}}

_ another one to provide the alarm information, and as any alarming system (including for example the prometheus alarm manager and others) it require to have unique "ID" to match a fault, and when it's solved.

{{ range $severity, $alerts := (groupAlertsByLabel .Alerts "severity") -}} {{- range $index, $alert := $alerts }} {{ $alert.Labels.severity }};{{ $alert.Labels.instance }};{{ $alert.Labels.job }};{{ $alert.Labels.alertname }};{{ $alert.Annotations.summary }};{{ $alert.Annotations.description }} {{ end }} {{ end }}

In my object, i got a CVS format string with the alertname, the instance, the job, the description and the summary So, the SNMP alarm system can use the alertname+instance to identify uniquely the alarm

What did you expect to see?

the alarms firing and resolving must be fairly identical, and only the description must change : FAULT or OK and the extra field allow to get in case of firing the description and summary and instance to document the alarm and the information will allow to match to firing and the resolved automatically

What did you see instead? Under which circumstances?

in case of alarms firing, no issue, everything is filled in case of alarms resolved, the extra field is empty as

Environment

Note : I tested with a modified version, build locally, with the code alert_parser.go, line 69 removed (and syntax corrected) and it was then working properly, and logically meaning every alarms are treated equals

snmp_notifier, version 1.5.0 (branch: main, revision: 934455898d4bc190e65aebc1356451196a6ec983) build user: tecnotree@centos build date: 20240913-16:08:54 go version: go1.22.5 (Red Hat 1.22.5-2.el9) platform: linux/amd64 tags: netgo

Version Information Branch: HEAD BuildDate: 20240228-11:51:20 BuildUser: root@22cd11f671e9 GoVersion: go1.21.7 Revision: 0aa3c2aad14cff039931923ab16b26b7481783b5 Version: 0.27.0

Not valid, as the alarms are coming from Grafana here

./snmp_notifier --snmp.trap-description-template=description-template.tpl --snmp.extra-field-template=4=object-template.tpl --snmp.version=V2c --snmp.destination=ss-vip:162 --snmp.community=tecnomen --snmp.timeout=5s --web.listen-address=:9465

maxwo commented 1 month ago

Thanks for your detailed message.

If after your modification of the parser, it worked as you expected, I propose you to use the .DeclaredAlerts variable in your template, which includes all the alerts, firing or not.

fabienmagagnosc commented 1 month ago

Hi there,

I'm looking at the declaredAlerts, as your code is more important than mine, and I'm still not having any result. is there a way to have all the information no matter if it's firing or resolving ?

right now, you code is clear :

alert_parser.go :

            _alertGroups[key].DeclaredAlerts = append(alertGroups[key].DeclaredAlerts, alert)
    if alert.Status == "firing" {
        err = alertParser.addAlertToGroup(alertGroups[key], alert)
        if err != nil {
            return nil, err
        }
    }_

only the firing alert got parser and completed with the labels, which an be used to passed into the SNMP alerts (via new OID)

maxwo commented 1 month ago

I'm gonna do some checks, as the default template seems to work well:

{{ len .Alerts }}/{{ len .DeclaredAlerts }} alerts are firing:

And it always display the "2/4 alerts are firing" for instance.

How about something like:

{{- range .DeclaredAlerts }}
{{- .Labels.severity }};{{ .Status }}{{ .Labels.instance }};{{ .Labels.job }};{{ .Labels.alertname }};{{ .Annotations.summary }};{{ .Annotations.description }}
{{ end }}

?

fabienmagagnosc commented 1 week ago

so sorry for the delay. I have been busy with others tasks.

basically, I can provide explanations only for most of the snmp system ,but not all.

you prefer to have 2 snmp alarms :

the mapping is mostly based on different OID and/or fields to provide the matching. in the same way as the Prometheus alert manager over the alarms (nothing new)

so, when actually alarm are send, you need to have a "constance" in the alarm format, to allow the third party SNMP system to recognize them.

and example :
OID : xxx status : firing severity : WARN server: server01 alarm: CPU over 80% - server01 job: node-exporter-job OID : xxx status : resolved severity : WARN server: server01 alarm: CPU over 80% - server01 job: node-exporter-job

the SNMP system can do the mapping and cancel the alarm.

I'm working on doing more sample now and I'll send asap some samples