openwisp / openwisp-monitoring

Network monitoring system written in Python and Django, designed to be extensible, programmable, scalable and easy to use by end users: once the system is configured, monitoring checks, alerts and metric collection happens automatically.
https://openwisp.io/docs/dev/monitoring/
Other
163 stars 110 forks source link

[feature] Handle cases in which devices are receiving metrics but not reachable via ping #566

Open nemesifier opened 5 months ago

nemesifier commented 5 months ago

There can be a conflicting situation in which a device is not reachable on the management IP but is sending metrics succesfully to the server.

Due to the recovery detection feature, this generates additional load on the server because as soon as metrics or checksum requests are received, the system schedules a ping because it belive it will be able to reach the device and hence set the status back to OK, but that won't happen.

If many devices are in this situation, the monitoring queue can grow indefinitely until consuming all the available memory, at that point the server will crash.

We need to devise a way to spot these situations and set the status to "PROBLEM".

In this case, the ping check should not set the status to CRITICAL even if it cannot ping, unless no metrics were received for more than 10 minutes.

The device recovery mechanism should not be triggered if the status of the device is not critical.

Maybe we could solve this by simply modifying the ping check to look whether the device has been receiving monitoring metrics before deciding to set the status to CRITICAL or PROBLEM.

SanjayKumar-M commented 4 months ago

Hey @nemesifier i would like to work on this!