[Bug]: Unreachable alerts during upgrades

hugovalente-pm commented 1 year ago

Bug description

A user has reported on discord (thread) that:

I get frequently false unreachable alerts about servers which are not offline. After 15 minutes the alert is gone without any intervention. It happens once a day and sometimes there are several days between those alerts. Anyone else with the same problem?

Seems to be related to auto-updates

most of the alerts are at the same time 7 am, 15 am ...

Expected behavior

Unreachable alerts should have a delay that should consider most common time period for an agent to updates and restart before being fired. The current time seems to be set to 30 seconds (TBC).

Steps to reproduce

1.

Screenshots

No response

Error Logs

No response

Desktop

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Browser Version [e.g. 22]

Additional context

No response

hugovalente-pm commented 1 year ago

@netdata/cloud-be we will need to review this delay but need to agree on the suitable time to set it, there is currently a suggestion from @ilyam8 to set this up to 90 seconds. cc/ @ralphm

luisj1983 commented 1 year ago

Please see if you can make it a configurable delay as some platforms can take a lot longer than others to get things done. Ideally a v2 iteration would be to hook into some sort of netdata agent health status so that we know when the agent really is back and ready to go :-)

ralphm commented 1 year ago

Unfortunately, startup time of an Agent is correlated with the amount of nodes and data retention. I know that the Agent team has been working hard on reducing this, and I think a delay is a reasonable short term solution.

A possible future approach is if the Agent could actively let Cloud know that it is going down for a restart or explicit shutdown. This allows Cloud to distinguish this from unexpected disconnects and have different notification behavior.

Also, ephemeral nodes should probably not yield notifications in most cases.

ilyam8 commented 1 year ago

Yes, and it is correlated with the system resources (e.g. slow storage can significantly delay parent instances with a lot of historical data).

hugovalente-pm commented 1 year ago

ok, probably we need to see if we need to tackle this in two steps:

More immediate solution: suggestion adding a unreachable timeout configuration per Space (?)
Have proper messaging from Agent to Cloud saying that it is going down for a restart

to try to access the urgency of this, I don't think I've seen this being reported very often - in Discord it was @luisj1983 and another user

luisj1983 commented 1 year ago

@hugovalente-pm People probably aren't bothering to report it, IMO After all, you can reproduce the behaviour any time you upgrade an agent and so it must be happening.

Issues

I think that there are two distinct but related issues here.

The agent-reachability issue This occurs when the Netdata Cloud thinks that the agent/node is unreachable. This is the issue reported by the chap on Discord.
The noisy-alarms issue This occurs when due to some foreseeable action on the agent we get a flurry of spurious alarms. Foreseeable scenarios would be things like agent upgrades and, in future, any maintenance actions the agent takes (e.g. db house-keeping) which are reported to cause unnecessary alerts.

I think that the agent-reachability issue is a sub-issue of the noisy-alarms because both are predicated on the way Netdata cloud or agent handles alerts related to agent-actions (I include packaging in that, fair or not).

Priority

I'd say this is important and non-urgent. That's because it's not exactly breaking anything (although the unnecessary nature of the alerts-noise could be debated) and thus non-urgent; but important precisely because it introduces noise into the alerts. It's also non-urgent because there are workarounds such as maintenance windows, silencing alarms etc.

Netdata works well to make infra and apps more visible with less noise; having activities of the agent contribute to the noise contravenes that axiom.

We are generating alerts which are completely foreseeable and avoidable in the scenario I'm talking about because they are generated by the management of the agent itself (whose packaging Netdata controls).

Now, of course, that's why in Ops we have things like maintenance windows and alarm profiles etc but none of that is very dynamic. I see this as a good way to differentiate Netdata from other solutions too and would encourage customers to keep up-to-date.

Workarounds

I was looking briefly at dpkg-hooks to see if I could make some changes there on my test-system and then that could be a documented workaround. I'm also going to start work in the next 1-2 weeks on an ansible role to handle maintenance scenarios. However, the problem as I see it is that, as far as I know, the Netdata agent has no queryable concept of being ready. So the problem is the result of having to use arbitrary tokens of readiness...

For example, the agent itself implicitly does this because as soon as you start the service it is ready in the sense that if there is anything amiss you'll start getting alerts straight away-ish.
Alternatively, we can talk about having cool-down timers, like already suggested. The problem with that is that there is no correct amount of time to wait. It's a reasonable workaround but given the amount of metrics the Netdata agent collects about it's own internals, I think the gold-standard would be to imagine the Netdata metrics as one collection and all the other metrics as another collection and somehow make the alerting on the latter dependent upon the former. A sort of agent-health concept.

Sorry for the long reply :-)

ilyam8 commented 1 year ago

@hugovalente-pm I still think that a good 0 step until 1, 2, etc discussed/implemented is increasing the timeout.

hugovalente-pm commented 1 year ago

@luisj1983 thanks for the detailed comment and I agree this is an important fix but not urgent (nothing is really breaking) but for sure we shouldn't spam users with alerts that are triggered due to agent updates

the best solution seems to really be 2. and ensure between Cloud and Agent it is understood when an agent is supposed to go down.

if nobody opposes we can increase it to 90 seconds, as you had suggested, @ilyam8 @car12o @ralphm any concerns?

car12o commented 1 year ago

@hugovalente-pm I'm ok with the change, although we need to bear in mind it will delay all kind of reachability notifications, even the ones some users may want to get paged asap.

ilyam8 commented 1 year ago

We understand that increasing the timeout to 90 will increase the reachability notifications delay.

luisj1983 commented 1 year ago

I'm fine with a delay since I'd rather get alerts that were meaningful. If this is a delay added to the agent startup then it's potentially quite useful too since we know what happens when you restart a server- you get lots of alerts because things may still be spinning up. What I would say is that I definitely wouldn't want a delay to the monitoring itself, as that's crucial to have data especially on startup.

One thing to note is that I'd strongly recommend that this is not a default but a timeout configurable in the netdata.conf.

hugovalente-pm commented 1 year ago

One thing to note is that I'd strongly recommend that this is not a default but a timeout configurable in the netdata.conf.

this is something that needs to be controlled from Cloud, it is Cloud that is identifying the unreachable status, so it would need to be something to be set per Space - which is some more effort than changing our current setting

luisj1983 commented 1 year ago

@hugovalente-pm OK but doesn't the agent have to tell the cloud "Hey I'm going sleepy-time now, don't go nuts and generate alerts?" If so then the agent can tell the cloud how long it's going down for (the configurable value), right? Not saying it has to be in the first iteration ofc :-)

sashwathn commented 6 days ago

@luisj1983 : We are working on this feature to be able to configure reachability notifications (at space level). We also have this feature to identify agent upgrades, intentional restarts etc - so we treat them differently from standard reachability notifications.

cc: @car12o @stelfrag

netdata / netdata-cloud