netdata / netdata-cloud

The public repository of Netdata Cloud. Contribute with bug reports and feature requests.
GNU General Public License v3.0
41 stars 16 forks source link

[Feat]: Cloud based Reachability Alerts - user configurable functionality improvements #1039

Closed dogsbody-josh closed 3 weeks ago

dogsbody-josh commented 2 months ago

Problem

The problem is Netdata Reachability Alerts are not configurable which leads to excessive alerts during upgrades and from ephemeral nodes or those that are turned off/on on schedule.

The lack of configurability is also a blocker to getting effective notifications that reach the right teams in the right way.

Finally, because Reachability Alerts are either on or off for all nodes in a room some ephemeral nodes have to get organised differently just to accommodate this lack of functionality.

Description

Reachability Alerts need the following functionality:

  1. Configurable per individual node, room or custom/selectable set of nodes across rooms. All the points below should be configurable in this way too.
  2. Conditions for triggering the alert should be configurable. Configuration options should include (at least) a customisable delay before triggering. This will solve issues like #858 and is of particular importance to parent/child setups where a restart of the parent agent can cascade hundreds of Reachability Alerts.
  3. Control over the content of the notification, including subject and body.
  4. Control over the way alerts are delivered, including the method (not just email or mobile app) and preferably integrating similar functionality to the Agent Dispatched Notifications (Roles and alternative notification methods).
  5. Ability to silence alerts per node/room/or custom/selectable set of nodes.
  6. Silences should be schedule-able, and have a 'recurring' functionality. This is so that nodes that are switched off over night or other particular recurring time period can be silenced appropriately.

Configuration of this functionality should be centrally configurable by the 'main' account for the service. This is particularly important for Business customers where a single main account will invite team members. Team members shouldn't individually have to configure Reachability Alerts, they should be managed in one secured account location and point 4 above would be used to control where Reachability Alerts are delivered.

Importance

blocker

Value proposition

Reachability Alerts are a critical component of a monitoring solution and deserve first class status. Knowing that a node is no longer reporting to the monitoring solution is as important as the monitoring itself. A node that is not reporting in has the potential to lose metrics and, more importantly, to not trigger health alerts.

Because of this we believe it is absolutely essential to immediately detect if a non-ephemeral node is no longer reachable so that relevant teams can immediately investigate and rectify the issue. This is even more important in parent/child setups where the parent is responsible for health alerts. Should such a parent node go unreachable it's possible that any subsequent health alerts for all nodes would not trigger.

Our aim is to avoid a situation where a node silently loses monitoring/alerting but be able to configure the parameters and notification options for receiving Reachability Alerts.

Proposed implementation

We don't have any specific implementation proposals, other than those alluded to above in that for Business customers with multiple invited accounts under a main Space 'owner' account, the new functionality should be placed under the main owner account.

sashwathn commented 2 months ago

@dogsbody-josh : We are making improvements to the reachability notifications (from the cloud). You may have noticed that we are grouping the reachability notifications if an agent is streaming to a parent to reduce the number of notifications that are sent out. We are also introducing configurable timeouts for reachability notifications (for the space initially) where the user can define a timeout that can cater to upgrades / manual restart of the agents.

  1. Configurable per individual node, room or custom/selectable set of nodes across rooms. All the points below should be configurable in this way too. --> We will make this configurable per space initially and will introduce additional configurations later on.
  2. Conditions for triggering the alert should be configurable. Configuration options should include (at least) a customisable delay before triggering. This will solve issues like [Bug]: Unreachable alerts during upgrades #858 and is of particular importance to parent/child setups where a restart of the parent agent can cascade hundreds of Reachability Alerts. --> I think this already exists for alerts and I assume you are referring to reachability notifications (which are not alerts in the Netdata terminology) and will be solved with the configurable timeouts mentioned in 1.
  3. Control over the content of the notification, including subject and body. --> This is not on our roadmap for now and we will try and look at this later on.
  4. Control over the way alerts are delivered, including the method (not just email or mobile app) and preferably integrating similar functionality to the Agent Dispatched Notifications (Roles and alternative notification methods). --> These notifications are already configurable for user-specific or organisation-specific integrations. Reachability notifications are not supported on the agent dispatched notifications.
  5. Ability to silence alerts per node/room/or custom/selectable set of nodes. --> This already exists as part of our silencing feature on Netdata Cloud (only available on paid plans though)
  6. Silences should be schedule-able, and have a 'recurring' functionality. This is so that nodes that are switched off over night or other particular recurring time period can be silenced appropriately. --> We have the scheduling feature already and the recurring functionality will be introduced soon.

cc: @car12o @juacker

dogsbody commented 2 months ago

Hi @sashwathn this sounds like an awesome start. Just being able to have a delay on the reachability alerts will probably reduce ~90% of our alerts.

I think it's safe to say that when @dogsbody-josh mentions "Alerts" in this ticket he is talking about reachability alerts. As such, I am a little confused by your answers to 4, 5 & 6? We are not aware of any ability to make these changes to reachability alerts?

Thank you

sashwathn commented 2 months ago

@dogsbody : You are right, I was talking about the silencing feature on the Alerts in general. For the reachability notifications, we only have a toggle to turn them on / off and may be we can introduce something similar to the alerts to schedule a silencing rule etc. @car12o @juacker : Wdyt about the silencing rules for reachability notifications?

car12o commented 1 month ago
  1. Control over the way alerts are delivered, including the method (not just email or mobile app) and preferably integrating similar functionality to the Agent Dispatched Notifications (Roles and alternative notification methods). Ability to silence alerts per node/room/or custom/selectable set of nodes.

This is already supported. Any notification integration configured on Cloud will deliver alerts & reachability notifications.

Regarding silencing rules (which already support schedule & recurring for alerts), we could extend this functionality for reachability as well. Sounds like a good idea to me.

car12o commented 1 month ago

Configurable reachability delay is new released. You can find it at your space settings under Alerts & Notifications menu, Reachability tab.

sashwathn commented 3 weeks ago

@dogsbody-josh : We have now introduced configurable timeouts for reachability notifications for the space and per room. You can access this under Space Settings --> Alerts and Notifications --> Reachability.

Hope this helps.