Open hugovalente-pm opened 1 year ago
@netdata/cloud-be we will need to review this delay but need to agree on the suitable time to set it, there is currently a suggestion from @ilyam8 to set this up to 90 seconds. cc/ @ralphm
Please see if you can make it a configurable delay as some platforms can take a lot longer than others to get things done. Ideally a v2 iteration would be to hook into some sort of netdata agent health status so that we know when the agent really is back and ready to go :-)
Unfortunately, startup time of an Agent is correlated with the amount of nodes and data retention. I know that the Agent team has been working hard on reducing this, and I think a delay is a reasonable short term solution.
A possible future approach is if the Agent could actively let Cloud know that it is going down for a restart or explicit shutdown. This allows Cloud to distinguish this from unexpected disconnects and have different notification behavior.
Also, ephemeral nodes should probably not yield notifications in most cases.
Yes, and it is correlated with the system resources (e.g. slow storage can significantly delay parent instances with a lot of historical data).
ok, probably we need to see if we need to tackle this in two steps:
to try to access the urgency of this, I don't think I've seen this being reported very often - in Discord it was @luisj1983 and another user
@hugovalente-pm People probably aren't bothering to report it, IMO After all, you can reproduce the behaviour any time you upgrade an agent and so it must be happening.
I think that there are two distinct but related issues here.
I think that the agent-reachability issue is a sub-issue of the noisy-alarms because both are predicated on the way Netdata cloud or agent handles alerts related to agent-actions (I include packaging in that, fair or not).
I'd say this is important and non-urgent. That's because it's not exactly breaking anything (although the unnecessary nature of the alerts-noise could be debated) and thus non-urgent; but important precisely because it introduces noise into the alerts. It's also non-urgent because there are workarounds such as maintenance windows, silencing alarms etc.
Netdata works well to make infra and apps more visible with less noise; having activities of the agent contribute to the noise contravenes that axiom.
We are generating alerts which are completely foreseeable and avoidable in the scenario I'm talking about because they are generated by the management of the agent itself (whose packaging Netdata controls).
Now, of course, that's why in Ops we have things like maintenance windows and alarm profiles etc but none of that is very dynamic. I see this as a good way to differentiate Netdata from other solutions too and would encourage customers to keep up-to-date.
I was looking briefly at dpkg-hooks to see if I could make some changes there on my test-system and then that could be a documented workaround. I'm also going to start work in the next 1-2 weeks on an ansible role to handle maintenance scenarios. However, the problem as I see it is that, as far as I know, the Netdata agent has no queryable concept of being ready. So the problem is the result of having to use arbitrary tokens of readiness...
Sorry for the long reply :-)
@hugovalente-pm I still think that a good 0 step until 1, 2, etc discussed/implemented is increasing the timeout.
@luisj1983 thanks for the detailed comment and I agree this is an important fix but not urgent (nothing is really breaking) but for sure we shouldn't spam users with alerts that are triggered due to agent updates
the best solution seems to really be 2. and ensure between Cloud and Agent it is understood when an agent is supposed to go down.
if nobody opposes we can increase it to 90 seconds, as you had suggested, @ilyam8 @car12o @ralphm any concerns?
@hugovalente-pm I'm ok with the change, although we need to bear in mind it will delay all kind of reachability notifications, even the ones some users may want to get paged asap.
We understand that increasing the timeout to 90 will increase the reachability notifications delay.
I'm fine with a delay since I'd rather get alerts that were meaningful. If this is a delay added to the agent startup then it's potentially quite useful too since we know what happens when you restart a server- you get lots of alerts because things may still be spinning up. What I would say is that I definitely wouldn't want a delay to the monitoring itself, as that's crucial to have data especially on startup.
One thing to note is that I'd strongly recommend that this is not a default but a timeout configurable in the netdata.conf.
One thing to note is that I'd strongly recommend that this is not a default but a timeout configurable in the netdata.conf.
this is something that needs to be controlled from Cloud, it is Cloud that is identifying the unreachable status, so it would need to be something to be set per Space - which is some more effort than changing our current setting
@hugovalente-pm OK but doesn't the agent have to tell the cloud "Hey I'm going sleepy-time now, don't go nuts and generate alerts?" If so then the agent can tell the cloud how long it's going down for (the configurable value), right? Not saying it has to be in the first iteration ofc :-)
@luisj1983 : We are working on this feature to be able to configure reachability notifications (at space level). We also have this feature to identify agent upgrades, intentional restarts etc - so we treat them differently from standard reachability notifications.
cc: @car12o @stelfrag
Bug description
A user has reported on discord (thread) that:
Seems to be related to auto-updates
Expected behavior
Unreachable alerts should have a delay that should consider most common time period for an agent to updates and restart before being fired. The current time seems to be set to 30 seconds (TBC).
Steps to reproduce
1.
Screenshots
No response
Error Logs
No response
Desktop
OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Browser Version [e.g. 22]
Additional context
No response