openwisp / openwisp-monitoring

Network monitoring system written in Python and Django, designed to be extensible, programmable, scalable and easy to use by end users: once the system is configured, monitoring checks, alerts and metric collection happens automatically.
https://openwisp.io/docs/dev/monitoring/
Other
163 stars 110 forks source link

Uptime,Packet loss and Round Trip Time Charts stop being added on devices. #555

Closed momothefox closed 9 months ago

momothefox commented 9 months ago

i am using stable latest release. on fresh installation the charts were working. after a while it stopped working.

after a migration of the server i had to reinstall and backup database. after a while it stopped working again.

if you can help investigate this

https://github.com/openwisp/openwisp-monitoring/assets/25464943/55604df5-2a42-40e6-8d36-3731e8b576bd

Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: > POST /api/v1/monitoring/device/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/?key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&time=25-11-2023_08:53:43.000000&current=true HTTP/1.1
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: > Host: xxx.xxx.com
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: > User-Agent: curl/8.4.0
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: > Accept: */*
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: > Content-Type: application/json
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: > Content-Length: 8381
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: >
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: } [8381 bytes data]
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < HTTP/1.1 200 OK
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Server: nginx
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Date: Sat, 25 Nov 2023 08:53:44 GMT
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Content-Length: 0
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Connection: keep-alive
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Vary: Accept, Cookie
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Allow: GET, POST, HEAD, OPTIONS
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < X-Frame-Options: DENY
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < X-Content-Type-Options: nosniff
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Referrer-Policy: same-origin
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < X-XSS-Protection: 1; mode=block
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < X-Content-Type-Options: nosniff
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Referrer-Policy: same-site
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Content-Security-Policy: default-src http: https: data: blob: 'unsafe-inline'; script-src 'unsafe-eval' https: 'unsafe-inline' 'self'; frame-ancestors 'self'; connect-src https://xxxx.xxxxx.com wss: 'self'; worker-src https://xxxx.xxxxx.com blob: 'self';
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Permissions-Policy: interest-cohort=()
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: < Strict-Transport-Security: max-age=31536000
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: <
Sat Nov 25 08:53:44 2023 daemon.info root: Data sent successfully.
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: root: Data sent successfully.
Sat Nov 25 08:53:44 2023 daemon.info root: No data file found to send.
Sat Nov 25 08:53:44 2023 daemon.err openwisp-monitoring[4901]: root: No data file found to send.
nemesifier commented 9 months ago

@momothefox are the monitoring workers running? The ping checks are performed by the server, the agent is not relevant here.

momothefox commented 9 months ago

Yes it is running. The problem is that old devices that has been added before, which has the Graphs on. they are getting updated. but newly added devices. stop creating these graphs. while the OK status is there all the time whenever the device is reachable or not.

# supervisorctl status
celery                           RUNNING   pid 413835, uptime 5:31:52
celery_firmware_upgrader         RUNNING   pid 413836, uptime 5:31:52
celery_monitoring                RUNNING   pid 413837, uptime 5:31:52
celery_network                   RUNNING   pid 413838, uptime 5:31:52
celerybeat                       RUNNING   pid 413839, uptime 5:31:52
daphne:asgi0                     RUNNING   pid 413952, uptime 5:31:45
daphne:asgi1                     RUNNING   pid 413841, uptime 5:31:52
daphne:asgi2                     RUNNING   pid 413842, uptime 5:31:52
daphne:asgi3                     RUNNING   pid 413843, uptime 5:31:52
daphne:asgi4                     RUNNING   pid 413844, uptime 5:31:52
daphne:asgi5                     RUNNING   pid 413845, uptime 5:31:52
openwisp2                        RUNNING   pid 413846, uptime 5:31:52
nemesifier commented 9 months ago

Go to the devices which do not have the ping charts and verify the "Checks" tab, look for the ping check, if it's not there, create it.

This is the code which creates the ping checks when new devices are created:

https://github.com/openwisp/openwisp-monitoring/blob/19163bf6adb2f4d47c7cecf248f74fae90218241/openwisp_monitoring/check/tasks.py#L70-L92

If that code above fails for any reason, the check will not be created.

momothefox commented 9 months ago

The checks are there

bandicam 2023-11-25 16-22-44-897

while the code is different than the one you referred to.

@shared_task
def auto_create_ping(
    model, app_label, object_id, check_model=None, content_type_model=None
):
    """
    Called by django signal (dispatch_uid: auto_ping)
    registered in check app's apps.py file.
    """
    Check = check_model or get_check_model()
    ping_path = 'openwisp_monitoring.check.classes.Ping'
    has_check = Check.objects.filter(
        object_id=object_id, content_type__model='device', check_type=ping_path
    ).exists()
    # create new check only if necessary
    if has_check:
        return
    content_type_model = content_type_model or ContentType
    ct = content_type_model.objects.get(app_label=app_label, model=model)
    check = Check(
        name='Ping', check_type=ping_path, content_type=ct, object_id=object_id
    )
    check.full_clean()
    check.save()

@shared_task
def auto_create_config_check(
    model, app_label, object_id, check_model=None, content_type_model=None
):
    """
    Called by openwisp_monitoring.check.models.auto_config_check_receiver
    """
    Check = check_model or get_check_model()
    config_check_path = 'openwisp_monitoring.check.classes.ConfigApplied'
    has_check = Check.objects.filter(
        object_id=object_id, content_type__model='device', check_type=config_check_path
    ).exists()
    # create new check only if necessary
    if has_check:
        return
    content_type_model = content_type_model or ContentType
    ct = content_type_model.objects.get(app_label=app_label, model=model)
    check = Check(
        name='Configuration Applied',
        check_type=config_check_path,
        content_type=ct,
        object_id=object_id,
    )
    check.full_clean()
    check.save()

i am using ansible role installation for production.

nemesifier commented 9 months ago

The difference in code is probably due to a version.

Check the monitoring log in /opt/openwisp2 and ensure it's doing something.

Is the ping not working for all the devices or only some of them?

momothefox commented 9 months ago

The celery-monitoring.log showing all the time that everything is fine.

INFO/MainProcess] Task openwisp_monitoring.check.tasks.perform_check[xxxxx] received
INFO/ForkPoolWorker-2] Task openwisp_monitoring.check.tasks.perform_check[xxxxx] succeeded in 0.024974617990665138s: None

some devices are working as expected, others are not.

momothefox commented 9 months ago

@nemesifier Should i suspect hardware performance ?

nemesifier commented 9 months ago

@momothefox I think more of some DB inconsistency. You could try deleting and recreating one of these devices which aren't pinged to see if anything changes.

momothefox commented 9 months ago

You could try deleting and recreating one of these devices which aren't pinged to see if anything changes.

I did already. no matter how many times you delete the device, it will never read these graphs, while other devices drawing these graphs. is there any limits ? i have 500+ devices 360 of them being monitored while the rest without monitoring agent.

nemesifier commented 9 months ago

There aren't any limits at application level. I am not sure what is going on in your case.

momothefox commented 9 months ago

nothing like this ?

https://github.com/openwisp/ansible-openwisp2/issues/431#issuecomment-1504078406

also i am using sqlite3 not PostgreSQL or MySQL, is it related to monitoring in anyway?

momothefox commented 9 months ago

Also Health Status remains OK whatever the condition of the device is.

nemesifier commented 9 months ago

I am sorry, I do not know what is wrong with your system, I have no way to verify nor replicate this problem. If you think it's a bug, please provide instructions on how to replicate it, at least tentative. We use the github issues for bug tracking only and not support requests. Please use the support chat if you have further questions.