pulibrary / princeton_ansible

Ansible Roles and Playbooks for Princeton University Library
10 stars 2 forks source link

[CheckMK] build and automate notification downtimes for cron reboots #5101

Open acozine opened 2 months ago

acozine commented 2 months ago

User story

Several of our systems have automatic reboots configured as cron jobs. As an operations engineer, I do not want to get notifications from our monitoring service when those systems reboot - it's not an outage, it's expected behavior.

Acceptance criteria

Concrete example

library-staging1 reboots just before midnight, Central time library-staging2 reboots an hour later

checkmk
APP  11:30 PM
Service PROBLEM notification
Host: [library-staging1.princeton.edu](http://library-staging1.princeton.edu/) (IP: 128.112.203.46)
Service: Check_MK
State: CRITICAL
Additional Info
[agent] Communication failed: [Errno 111] Connection refused(!!), [piggyback] Success (but no data found for this host), Missing monitoring data for all plugins(!), execution time 0.0 sec
Please take a look: @tmincher, @ansible
Check_MK notification: Fri Jul 5 00:30:36 EDT 2024

[:31](https://pulibrary.slack.com/archives/C04V5DARS2E/p1720153902086789)
Service RECOVERY notification
Host: library-staging1.princeton.edu (IP: 128.112.203.46)
Service: Check_MK
State: OK
Additional Info
[agent] Success, [piggyback] Success (but no data found for this host), execution time 3.4 sec
Please take a look: @tmincher, @ansible
Check_MK notification: Fri Jul 5 00:31:40 EDT 2024

Service RECOVERY notification
Host: [library-staging1.princeton.edu](http://library-staging1.princeton.edu/) (IP: 128.112.203.46)
Service: Check_MK
State: OK
Additional Info
[agent] Success, [piggyback] Success (but no data found for this host), execution time 3.4 sec
Please take a look: @tmincher, @ansible
Check_MK notification: Fri Jul 5 00:31:40 EDT 2024
acozine commented 1 month ago

Here's a module that looks like the right thing: https://galaxy.ansible.com/ui/repo/published/checkmk/general/content/module/downtime/