netdata / netdata-cloud

The public repository of Netdata Cloud. Contribute with bug reports and feature requests.
GNU General Public License v3.0
41 stars 16 forks source link

[Bug]: Notification integrations that open an incident have indicent close issues #820

Closed hugovalente-pm closed 1 year ago

hugovalente-pm commented 1 year ago

Bug description

For Notification Integrations that integrate with an Incident Management system and Netdata opens an incident, using a key like alert+node (TBC), have issues when the incident should be closed because the node could not report the CLEAR state transition. This can happen from:

Expected behavior

When an incident is opened on such integrations we need to make sure we are able to close the incidents when a node isn't able to send the CLEAR state transition. A mechanism needs to be in place to ensure these incidents aren't left hanging on the Incident Management tools.

Steps to reproduce

  1. Setup a PagerDuty or OpsGenie notification integration
  2. Get some alerts flowing
  3. Get the node Offline before the CLEAR status is sent

Screenshots

No response

Error Logs

No response

Desktop

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Browser Version [e.g. 22]

Additional context

No response

hugovalente-pm commented 1 year ago

placing as medium since we don't have many users experiencing this

hugovalente-pm commented 1 year ago

@car12o @ralphm I've created this bug following up the this slack thread

hugovalente-pm commented 1 year ago

based on a call today with @hugovalente-pm @car12o @ralphm @sashwathn some direction to this:

what I'm not sure is if something was discussed to address this point "The chart for which the alert was triggered was removed, e.g. container/cgroup is no longer live, the disk or device was disconnected". do you know @car12o ?

car12o commented 1 year ago

thanks @hugovalente-pm for the summary. there are some points that I had a difference understanding, so better to clear things out before development.

we should have some integrations that won't need to be affected by the flood protection, for now this seems to only be applicable to the e-mail

  • I think that all integrations should be affected (email, slack, discord), less PagerDuty, Opsgenie & MobileApp.

we should have some integrations that the silencing won't apply for the CLEAR transition, these are those that integrate with an incident management tool (PagerDuty and Opsgenie) and the Mobile App (@sashwathn please consider this for the requirements)

  • integrations with an incident management tool will never be silenced regardless status and not only on clear. MobileApp should be affected by silencing, as it's a requirement to be able to silence alerts through the App.

when a node becomes Offline what was discussed for the notification integrations that integrate with an incident management we should clear out the incident (send a CLEAR from Cloud). some approaches discussed were

  • in this case, I would suggest we clear all active incidents and create a new one for the node being offline. when it becomes online again we recreate the incidents based on the active alerts we have on the DB, if we have none (node alerts were already pruned, after 2 days), we raise no incidents. in this last scenario, a full alerts re-sync between agent and cloud will happen, so incidents may be recreated if reported back from the agent.

when a node becomes Offline what was discussed for the notification integrations that integrate with an incident management we should clear out the incident (send a CLEAR from Cloud). some approaches discussed were

  • I was thinking to process this status only for integrations with an incident management tool.

pls share your thoughts @hugovalente-pm @ralphm @sashwathn

hugovalente-pm commented 1 year ago

ok, let's speak after the daily

hugovalente-pm commented 1 year ago

confirmation on what was decided:

Discussions on clearing immediately alerts for nodes marked as ephemeral will be taken care on https://github.com/netdata/product/issues/2394

car12o commented 1 year ago

all fixes are now deployed to production.