Closed hugovalente-pm closed 1 year ago
placing as medium since we don't have many users experiencing this
@car12o @ralphm I've created this bug following up the this slack thread
based on a call today with @hugovalente-pm @car12o @ralphm @sashwathn some direction to this:
CLEAR
transition, these are those that integrate with an incident management tool (PagerDuty and Opsgenie) and the Mobile App (@sashwathn please consider this for the requirements)Offline
what was discussed for the notification integrations that integrate with an incident management we should clear out the incident (send a CLEAR
from Cloud). some approaches discussed were
what I'm not sure is if something was discussed to address this point "The chart for which the alert was triggered was removed, e.g. container/cgroup is no longer live, the disk or device was disconnected". do you know @car12o ?
thanks @hugovalente-pm for the summary. there are some points that I had a difference understanding, so better to clear things out before development.
we should have some integrations that won't need to be affected by the flood protection, for now this seems to only be applicable to the e-mail
- I think that all integrations should be affected (email, slack, discord), less PagerDuty, Opsgenie & MobileApp.
we should have some integrations that the silencing won't apply for the CLEAR transition, these are those that integrate with an incident management tool (PagerDuty and Opsgenie) and the Mobile App (@sashwathn please consider this for the requirements)
- integrations with an incident management tool will never be silenced regardless status and not only on
clear
. MobileApp should be affected by silencing, as it's a requirement to be able to silence alerts through the App.when a node becomes Offline what was discussed for the notification integrations that integrate with an incident management we should clear out the incident (send a CLEAR from Cloud). some approaches discussed were
- in this case, I would suggest we clear all active incidents and create a new one for the node being offline. when it becomes online again we recreate the incidents based on the active alerts we have on the DB, if we have none (node alerts were already pruned, after 2 days), we raise no incidents. in this last scenario, a full alerts re-sync between agent and cloud will happen, so incidents may be recreated if reported back from the agent.
when a node becomes Offline what was discussed for the notification integrations that integrate with an incident management we should clear out the incident (send a CLEAR from Cloud). some approaches discussed were
- I was thinking to process this status only for integrations with an incident management tool.
pls share your thoughts @hugovalente-pm @ralphm @sashwathn
ok, let's speak after the daily
confirmation on what was decided:
Discussions on clearing immediately alerts for nodes marked as ephemeral will be taken care on https://github.com/netdata/product/issues/2394
all fixes are now deployed to production.
Bug description
For Notification Integrations that integrate with an Incident Management system and Netdata opens an incident, using a key like alert+node (TBC), have issues when the incident should be closed because the node could not report the
CLEAR
state transition. This can happen from:Offline
CLEAR
notification wasn't sentCLEAR
notification from being sent (this is an issue when silencing is live)Expected behavior
When an incident is opened on such integrations we need to make sure we are able to close the incidents when a node isn't able to send the
CLEAR
state transition. A mechanism needs to be in place to ensure these incidents aren't left hanging on the Incident Management tools.Steps to reproduce
Offline
before the CLEAR status is sentScreenshots
No response
Error Logs
No response
Desktop
OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Browser Version [e.g. 22]
Additional context
No response