shinken-solutions / shinken

Flexible and scalable monitoring framework
http://www.shinken-monitoring.org
GNU Affero General Public License v3.0
1.13k stars 336 forks source link

Poller is not picking up the Queued tasks, the Host and Service checks are getting timed out. #2032

Open sjose1x opened 6 months ago

sjose1x commented 6 months ago

Hi All, I'm blocked due to failure with Shinken monitoring.., someone please help me on this... I can provide all information's and logs for sorting out this issue.

with OS upgrade, we are planning to run container based Shinken components for monitoring Windows 2008, 2012 and 2019 nodes. individual containers are been created for all the components say Arbiter, Broker, Poller, Scheduler...etc

Configured Webui and Thruk for UI

All the configurations are good, and connectivity is there, but I could notice that the Hosts and Service checks are getting timed out.., after been in Pending state for quite some time... moreover these checks are not executing on time as well...

Enabled Debug log level and I could see queued up tasks are not getting picked by Poller.. Could not understand the reason behind it..

But when I logged in to the Poller-container and executed the command manually it got executed instantly..

System Details

OS - Alma 9.3 CPU - 4 RAM - 8GB Shinken Version - 2.4.3

Even though the system has sufficient resources and only a few simple services are there to check, still the Poller service is consuming most of the CPU but the tasks are not getting done.

In the Poller log Queued up tasks are not getting picked up

[1710406970] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:0.0][calling:3.09944152832e-06] [json:8.10623168945e-06] [global:1.62124633789e-05]
[1710406970] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406970] DEBUG: [Shinken] Debug perf: get_broks [args:4.00543212891e-05] [aqu_lock:0.000130891799927][calling:5.50746917725e-05] [json:7.86781311035e-06] [global:0.000233888626099]
[1710406971] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.00135803223e-05] [global:1.71661376953e-05]
[1710406971] DEBUG: [Shinken] Debug perf: get_external_commands [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:2.40802764893e-05] [json:9.77516174316e-06] [global:3.981590271e-05]
[1710406971] DEBUG: [Shinken] Debug perf: ping [args:3.09944152832e-06] [aqu_lock:9.53674316406e-07][calling:1.90734863281e-06] [json:8.10623168945e-06] [global:1.4066696167e-05]
[1710406971] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406971] DEBUG: [Shinken] Debug perf: get_broks [args:5.48362731934e-05] [aqu_lock:0.0001540184021][calling:5.50746917725e-05] [json:1.00135803223e-05] [global:0.000273942947388]
[1710406972] DEBUG: [Shinken]  ========================
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:0 (Queued:154 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:1 (Queued:0 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:2 (Queued:4 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] [0][shinken-scheduler-1][fork] Stats: Workers:3 (Queued:16 TotalReturnWait:0)
[1710406972] DEBUG: [Shinken] I decide to up wait ratio
[1710406972] DEBUG: [Shinken] Wait ratio: 2.040918
[1710406972] DEBUG: [Shinken] [shinken-poller-1] Trying to adjust worker number. Actual number : 4, min per module : 4, max per module : 4
[1710406972] DEBUG: [Shinken] Ask actions to 0, got 0
[1710406972] DEBUG: [Shinken] Loop turn
[1710406972] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:3.09944152832e-06] [json:8.82148742676e-06] [global:1.78813934326e-05]
[1710406972] DEBUG: [Shinken] Debug perf: get_external_commands [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.59876251221e-05] [json:9.05990600586e-06] [global:4.00543212891e-05]
[1710406972] DEBUG: [Shinken] Debug perf: ping [args:3.81469726562e-06] [aqu_lock:0.0][calling:2.14576721191e-06] [json:1.78813934326e-05] [global:2.38418579102e-05]
[1710406972] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406972] DEBUG: [Shinken] Debug perf: get_broks [args:3.91006469727e-05] [aqu_lock:0.000120878219604][calling:5.50746917725e-05] [json:8.10623168945e-06] [global:0.000223159790039]
[1710406973] DEBUG: [Shinken] Debug perf: ping [args:2.86102294922e-06] [aqu_lock:7.15255737305e-06][calling:1.90734863281e-06] [json:1.00135803223e-05] [global:2.19345092773e-05]
[1710406973] DEBUG: [Shinken] Debug perf: get_external_commands [args:5.96046447754e-06] [aqu_lock:9.53674316406e-07][calling:3.09944152832e-05] [json:1.19209289551e-05] [global:4.98294830322e-05]
[1710406973] DEBUG: [Shinken] Debug perf: ping [args:3.81469726562e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710406973] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710406973] DEBUG: [Shinken] Debug perf: get_broks [args:5.07831573486e-05] [aqu_lock:0.000138998031616][calling:5.50746917725e-05] [json:1.00135803223e-05] [global:0.00025486946106]
[1710406974] DEBUG: [Shinken]  ========================

Scheduler DEBUG logs

[1710407544] DEBUG: [Shinken] Debug perf: ping [args:5.00679016113e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.09672546387e-05] [global:1.90734863281e-05]
[1710407544] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710407544] DEBUG: [Shinken] Debug perf: get_broks [args:3.88622283936e-05] [aqu_lock:0.0001380443573][calling:5.19752502441e-05] [json:7.15255737305e-06] [global:0.000236034393311]
[1710407545] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:0.0
2024-03-14 05:12:27,044 DEBG 'shinken-scheduler' stdout output:
][calling:2.86102294922e-06] [json:9.05990600586e-06] [global:1.59740447998e-05]
[1710407545] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407545] DEBUG: [Shinken] Debug perf: get_checks [args:8.58306884766e-05] [aqu_lock:0.00012993812561][calling:4.50611114502e-05] [json:5.96046447754e-06] [global:0.000266790390015]
[1710407545] DEBUG: [Shinken] Debug perf: put_results [args:0.000179052352905] [aqu_lock:0.0][calling:1.4066696167e-05] [json:1.50203704834e-05] [global:0.000208139419556]
[1710407545] DEBUG: [Shinken] Load: (sleep) 1.00 (average: 1.00) -> 0%
[1710407545] DEBUG: [Shinken] Time to send 0 broks (after 0 secs)
[1710407545] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407545] DEBUG: [Shinken] Latency (avg/min/max): 1435.98/1.88/2870.07
[1710407545] DEBUG: [Shinken] Check average = 0 checks/s
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.86102294922e-06] [json:1.00135803223e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_broks
[1710407546] DEBUG: [Shinken] Debug perf: get_broks [args:3.69548797607e-05] [aqu_lock:0.000128984451294][calling:4.81605529785e-05] [json:5.96046447754e-06] [global:0.000220060348511]
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:2.14576721191e-06] [json:1.00135803223e-05] [global:1.71661376953e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407546] DEBUG: [Shinken] Debug perf: get_checks [args:7.89165496826e-05] [aqu_lock:0.000149011611938][calling:4.81605529785e-05] [json:6.91413879395e-06] [global:0.000283002853394]
[1710407546] DEBUG: [Shinken] Debug perf: put_results [args:0.000194072723389] [aqu_lock:9.53674316406e-07][calling:1.50203704834e-05] [json:1.59740447998e-05] [global:0.000226020812988]
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:4.05311584473e-06] [aqu_lock:9.53674316406e-07][calling:1.90734863281e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407546] DEBUG: [Shinken] Debug perf: get_checks [args:9.67979431152e-05] [aqu_lock:0.0001380443573][calling:6.60419464111e-05] [json:8.10623168945e-06] [global:0.000308990478516]
[1710407546] DEBUG: [Shinken] Debug perf: put_results [args:0.000180959701538] [aqu_lock:0.0][calling:1.4066696167e-05] [json:1.50203704834e-05] [global:0.000210046768188]
[1710407546] DEBUG: [Shinken] Load: (sleep) 1.00 (average: 1.00) -> 0%
[1710407546] DEBUG: [Shinken] Time to send 0 broks (after 0 secs)
[1710407546] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407546] DEBUG: [Shinken] Latency (avg/min/max): 1435.98/1.88/2870.07
[1710407546] DEBUG: [Shinken] Check average = 0 checks/s
[1710407546] DEBUG: [Shinken] Debug perf: ping [args:2.86102294922e-06] [aqu_lock:1.19209289551e-06][calling:2.86102294922e-06] [json:1.09672546387e-05] [global:1.78813934326e-05]
[1710407546] DEBUG: [Shinken] The arbiter asked me what I manage. It's {0: 128736}
[1710407546] DEBUG: [Shinken] Debug perf: what_i_managed [args:3.81469726562e-06] [aqu_lock:1.19209289551e-06][calling:0.000140905380249] [json:2.09808349609e-05] [global:0.000166893005371]
[1710407547] DEBUG: [Shinken] Debug perf: ping [args:6.91413879395e-06] [aqu_lock:1.90734863281e-06][calling:3.09944152832e-06] [json:1.78813934326e-05] [global:2.98023223877e-05]
[1710407547] DEBUG: [Shinken] HTTP: calling lock for get_checks
[1710407547] DEBUG: [Shinken] Debug perf: get_checks [args:8.70227813721e-05] [aqu_lock:0.000164985656738][calling:4.81605529785e-05] [json:5.96046447754e-06] [global:0.000306129455566]
[1710407547] DEBUG: [Shinken] Debug perf: ping [args:4.

In the Scheduler log I could see these WARNING

docker logs shinken-scheduler-1 |grep WARNING

[1710405731] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405791] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405851] WARNING: [Shinken] 4 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405911] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710405971] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406031] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406091] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406151] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406211] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406932] WARNING: [Shinken] 6 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710406992] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407052] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407112] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling
[1710407173] WARNING: [Shinken] 5 actions never came back for the satellite 'shinken-poller-1'.I reenable them for polling

Scheduler log

docker logs shinken-scheduler-1 |grep 'Checks: total'

[1710407422] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407423] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407424] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407425] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407426] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[1710407427] DEBUG: [Shinken] Checks: total 7, scheduled 0,inpoller 7, zombies 0, notifications 0
[root@localhost docker-shinken]# docker ps
CONTAINER ID   IMAGE                              COMMAND                  CREATED       STATUS       PORTS                                                            NAMES
3ab294d19b53   thews/shinken-receiver:latest      "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7773/tcp                                                         shinken-receiver-1
cd1b61bf032e   thews/shinken-thruk:latest         "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   0.0.0.0:8000->80/tcp, :::8000->80/tcp                            docker-shinken-shinken-thruk-1
42b05cd36e91   thews/shinken-poller:latest        "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7771/tcp                                                         shinken-poller-1
cbf7ae4065a8   thews/shinken-arbiter:latest       "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7770/tcp                                                         shinken-arbiter-1
33d919544257   thews/shinken-reactionner:latest   "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7769/tcp                                                         shinken-reactionner-1
1b4e9f25fcec   thews/shinken-scheduler:latest     "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7768/tcp                                                         shinken-scheduler-1
b91b9d835c0a   thews/shinken-broker:latest        "/bin/sh -c 'supervi…"   2 hours ago   Up 2 hours   7772/tcp, 50000/tcp, 0.0.0.0:8001->7767/tcp, :::8001->7767/tcp   shinken-broker-1
docker stats

CONTAINER ID   NAME                             CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
3ab294d19b53   shinken-receiver-1               0.80%     42.84MiB / 7.493GiB   0.56%     46.4MB / 42.4MB   0B / 62.7MB   6
cd1b61bf032e   docker-shinken-shinken-thruk-1   0.02%     133.7MiB / 7.493GiB   1.74%     301kB / 839kB     0B / 1.2MB    16
42b05cd36e91   shinken-poller-1                 390.03%   75.05MiB / 7.493GiB   0.98%     52.3MB / 50.8MB   0B / 66.4MB   51
cbf7ae4065a8   shinken-arbiter-1                0.51%     45.05MiB / 7.493GiB   0.59%     94.3MB / 107MB    0B / 1.77MB   8
33d919544257   shinken-reactionner-1            0.96%     49.39MiB / 7.493GiB   0.64%     64.4MB / 68MB     0B / 71.3MB   10
1b4e9f25fcec   shinken-scheduler-1              0.74%     44.74MiB / 7.493GiB   0.58%     70.4MB / 59.4MB   0B / 86.3MB   6
b91b9d835c0a   shinken-broker-1                 1.19%     77.4MiB / 7.493GiB    1.01%     110MB / 109MB     0B / 76.7MB   13

@geektophe Could you please help me on this.. ;)

sjose1x commented 5 months ago

I could locate the issue as in, the docker is not support multiprocessing... Poller uses multiprocessing for running worker.. need to figure out how to mitigate this

Workers are running, but work.start() is getting stuck.. from there on no progress