Reworked the logic blocks to break out WARNING from CRITICAL

rjr162 commented 7 years ago

Adjusted settings of OK;SOFT to set status back to Operational

rjr162 commented 7 years ago

Tested these changes by having nagios check a test VM where I'd start and stop httpd and also had index.php setup so I could adjust the load time to test using -w and -c on the check_http

2Belette commented 7 years ago

HI, thanks for the pull request, I am testing it but it is like I sill got an issue and one improvement. The improvement is that my component get refreshed after a partial outage (still need to test further), the issue I got now is that even if my component is refreshed and get back to 'operational', the whole Cachet system gets locked into 'Some systems are experiencing issues' and never get back to 'All systems are operational' perhaps because after the SOFT we cleared the entry and so the new message from nagios is discarded ? Do you have the same issue on your global status in Cachet ?

rjr162 commented 7 years ago

2Belette: I did notice that too but I wasn't sure if that was an issue that existed prior or not as I didn't really pay attention to that before I made the tweaks.

If I find some time over the next couple of days I'll do some more testing to see if I can't figure out why that's not being cleared

2Belette commented 7 years ago

Perfect ! Don't hesitate if you need me to test anything

rjr162 commented 7 years ago

Well I can tell what the issue is (both looking at my Cachet board and also glancing over the code to the python based Cachet URL monitor). The incident that's created isn't set to "fixed" so Cachet thinks something is still wrong. The Python code updates an existing incident it if becomes healthy again.. so if we can get some code into the nagios plugin to do the same behavior, should be golden

2Belette commented 7 years ago

Interesting! So you think there is no way from the nagios-eventhandler-cachet to send to Cachet through the API that the incident is fixed ?

rjr162 commented 7 years ago

2Belette,

With looking at the API directly, Cachet was getting the whole "things are okay" deal, but wasn't clearing the "not functional" status on the incident.

I reworked some of the code to both create a new incident for the component that records when things "are OK" and also finds the incident of the same name that doesn't have a status of "OPERATIONAL" (status 4) and marks it to status 4.

I realized I had to break these out. If you just set the status to 4 with the existing code, it would change the incident to "OK" and make everything report good as expected, including the banner, but it would change all the incident information so it was reporting the Nagios "HTTP OK" information instead of the Nagios error message you would want to display. The way I have adjusted it, it will update the "Issue/error" incident to "fixed" wthout adjusting the information in the Incident while also creating a new "Incident" that just reports things are back up, so folks can see when things broke and when they came back.

Let me know if it works as I've been working with a crap keyboard since my daughter spilled water on it and I may have missed a keyboard induced glitch in the editing and copy/pasting

rjr162 commented 7 years ago

One thing that needs to be tweaked, in my opinion, is the creation of new incidents. Every failed nagios check will create a new "error" incident in cachet, but really you'd only need the one error until the issue was resolved instead of filling the screen with a ton of error messages covering the same incident.

That will be the next step if the above changes work for you like they were for me

2Belette commented 7 years ago

rjr162,

Many thanks for your pull request! I pulled it, tested it and confirm to you that it is working :) I have tested from Hard(Critical) -> Operational and from Soft(Warning) -> Operational.

I agree with you that it would be perfect to get only one event in case of multiple error incident for the same component rather than getting a very long list difficult to clean.

So I am 100% with you for your second step idea to aggregate everything on the same event in case of X number of error for the same component.

2Belette commented 7 years ago

Hi rjr162

Just to let you know that it has been 10 days your commit is running on my side with about 40 components and multiple tests and everything is working as expected.

I also confirm that it would be much better to tweak it to get only one error created on cachet and one solved message instead of a ton of errors in the list, this would be a huge improvement.

Many thanks :)

rjr162 commented 7 years ago

I have another pull request in, and this one should fix the outstanding issues from my prior commit as well as add the option to upgrade metrics if you add a flag in nagios for the event_handler. I realized that breaking these into two separate requests would have been the better way to do it rather than submit both the completion of the fix to the existing issue while introducing a new 'feature'

I wasn't sure if that was a bit overkill or not, but it could be enabled via using a -m=True flag. This would update the page load time into the metrics for that component (if you have a metric for it.. name has to match etc) event_handler cachet_notify! -m=true Personally I don't care for the way metrics are auto-updated with a default value in Cachet, but so be it. I thought this may be handy for folks who do want metrics included.

mpellegrin / nagios-eventhandler-cachet

Reworked the logic blocks to break out WARNING from CRITICAL #14