Closed dogsbody-josh closed 3 weeks ago
@dogsbody-josh : We are making improvements to the reachability notifications (from the cloud). You may have noticed that we are grouping the reachability notifications if an agent is streaming to a parent to reduce the number of notifications that are sent out. We are also introducing configurable timeouts for reachability notifications (for the space initially) where the user can define a timeout that can cater to upgrades / manual restart of the agents.
cc: @car12o @juacker
Hi @sashwathn this sounds like an awesome start. Just being able to have a delay on the reachability alerts will probably reduce ~90% of our alerts.
I think it's safe to say that when @dogsbody-josh mentions "Alerts" in this ticket he is talking about reachability alerts. As such, I am a little confused by your answers to 4, 5 & 6? We are not aware of any ability to make these changes to reachability alerts?
Thank you
@dogsbody : You are right, I was talking about the silencing feature on the Alerts in general. For the reachability notifications, we only have a toggle to turn them on / off and may be we can introduce something similar to the alerts to schedule a silencing rule etc. @car12o @juacker : Wdyt about the silencing rules for reachability notifications?
- Control over the way alerts are delivered, including the method (not just email or mobile app) and preferably integrating similar functionality to the Agent Dispatched Notifications (Roles and alternative notification methods). Ability to silence alerts per node/room/or custom/selectable set of nodes.
This is already supported. Any notification integration configured on Cloud will deliver alerts & reachability notifications.
Regarding silencing rules (which already support schedule & recurring for alerts), we could extend this functionality for reachability as well. Sounds like a good idea to me.
Configurable reachability delay is new released.
You can find it at your space settings under Alerts & Notifications
menu, Reachability
tab.
@dogsbody-josh : We have now introduced configurable timeouts for reachability notifications for the space and per room. You can access this under Space Settings --> Alerts and Notifications --> Reachability.
Hope this helps.
Problem
The problem is Netdata Reachability Alerts are not configurable which leads to excessive alerts during upgrades and from ephemeral nodes or those that are turned off/on on schedule.
The lack of configurability is also a blocker to getting effective notifications that reach the right teams in the right way.
Finally, because Reachability Alerts are either on or off for all nodes in a room some ephemeral nodes have to get organised differently just to accommodate this lack of functionality.
Description
Reachability Alerts need the following functionality:
Configuration of this functionality should be centrally configurable by the 'main' account for the service. This is particularly important for Business customers where a single main account will invite team members. Team members shouldn't individually have to configure Reachability Alerts, they should be managed in one secured account location and point 4 above would be used to control where Reachability Alerts are delivered.
Importance
blocker
Value proposition
Reachability Alerts are a critical component of a monitoring solution and deserve first class status. Knowing that a node is no longer reporting to the monitoring solution is as important as the monitoring itself. A node that is not reporting in has the potential to lose metrics and, more importantly, to not trigger health alerts.
Because of this we believe it is absolutely essential to immediately detect if a non-ephemeral node is no longer reachable so that relevant teams can immediately investigate and rectify the issue. This is even more important in parent/child setups where the parent is responsible for health alerts. Should such a parent node go unreachable it's possible that any subsequent health alerts for all nodes would not trigger.
Our aim is to avoid a situation where a node silently loses monitoring/alerting but be able to configure the parameters and notification options for receiving Reachability Alerts.
Proposed implementation
We don't have any specific implementation proposals, other than those alluded to above in that for Business customers with multiple invited accounts under a main Space 'owner' account, the new functionality should be placed under the main owner account.