Recurring incidents - Githubissues

prokoba commented 9 years ago

Greetings, You have done a great job here! One issue here - when an identical incident repeats more than once the status change to "Fixed" is updated only for the first occurrence. The following incidents are not updated.

mpellegrin commented 9 years ago

Hello, and thanks for your feedback.

It's true that when triggering "HARD", the script is creating the incident no matter if similar incidents have already been sent.

However, if you have duplicate incidents, it means you have attached the event handler to multiple hosts (but same service) to a single Cachet component. As I do not collect the host name, it's hard to know which incident has to be updated/closed based only on the service name and description that is triggering RECOVERY. That's why i was assuming that the first incident i find is the right one (the handler do not keep track of opened incidents).

If I check if the incident is already created, it would mean that as soon as a service is hitting RECOVERY, the incident will be resolved, even if some hosts are still down on that service.

The only way to resolve this would be adding the host in parameters, and changing the script to store the information in a state file and/or send this information to Cachet, so it can find the right incident afterwards to update it.

Can you confirm that you attached multiple Nagios hosts with same Nagios service ?

prokoba commented 9 years ago

Hello, the event handler is attached to a single host, called by one service. I guess i missed the whole idea that the incidents are representing a timeline, each following and updating the status of the first. I thought that the first should be updated instead of adding new incident.

Thank You for the quick response!

mpellegrin commented 9 years ago

In that case, it means the event handler didn't fire properly on RECOVERY. With a single host and single service setup, the event should set the status to "Fixed" upon RECOVERY, before triggering HARD CRIT/WARN again (it seems logical that the service has to trigger RECOVERY before triggering a CRIT/WARN again).

From https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/eventhandlers.html :

It should be noted that the event handler will only be executed the first time that the service falls into a HARD problem state. This prevents Nagios from continuously executing the script to restart the web server if the service remains in a HARD problem state.

Or your services may be flapping from one state to an other but the event should trigger when flapping.

Did you see the event triggering in Nagios logs ? If not, please enable :

log_host_retries=1
log_service_retries=1
log_event_handlers=1

in Nagios configuration (and restart Nagios), then show us the corresponding lines from the log and a screenshot of the created events in Cachet (the bug has to appear again to be meaningful).

Yes Cachet show a timeline, but that doesn't mean you don't have to close the open incidents: if there is still one unresolved incident in timeline, the status on the top of the page will still be red, so it should be fixed.

I will try to improve the script in the next days/weeks to include hosts in parameters, to deal with the multiple hosts issues. Having a full host+service state will be more meaningful to get the right incident. Also, because I don't like showing hosts in public pages, I will implement this with a statefile to store the handled incidents. Maybe it will solve your issue by itself ; if not, it will give us more clues on what is going on (every created incident will be tracked until resolution).

prokoba commented 9 years ago

I figured it out after playing with POST and PUT requests to cachet, and further reading the cachet issues comments. Each incident and component has a unique id when created, and they have to be updated using that id.

Updating the status of incident 39 as an example: curl -X PUT -H "Content-Type: application/json;" -H "X-Cachet-Token: secret" -d '{"status":"3","visible":"1"}' http://localhost/api/v1/incidents/39

When posting a new incident the server returns the id in the response ( tested it in console ) 43 in that case: "data":{"name":"880","message":"some message","status":"1","visible":1,"updated_at":"2015-08-28 13:21:48","created_at":"2015-08-28 13:21:48","id":43,"human_status":"Investigating","scheduled_at":"2015-08-28 13:21:48"}}

Hope the information helps.

chooko commented 7 years ago

For people coming to this after the fact, I'd like to note that the following is no longer true in Cachet:

"If there is still one unresolved incident in timeline, the status on the top of the page will still be red, so it should be fixed."

You now have multiple incidents that show as a timeline, and when the final incident is marked as fixed, the status page returns to green.

mpellegrin / nagios-eventhandler-cachet

Recurring incidents #1