naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
151 stars 63 forks source link

External Command RESTART_PROCESS not working #133

Closed sni closed 8 years ago

sni commented 8 years ago

On some machines the RESTART_PROCESS external command just does nothing.

The debug log does not contain any useful hints:

[1464018735.672096] [128.2] [pid=8021] Raw command entry: [1464018735] RESTART_PROCESS
[1464018735.672397] [064.1] [pid=8021] Making callbacks (type 2)...
[1464018735.672758] [064.2] [pid=8021] Callback #1 (type 2) return code = 0
[1464018735.672871] [064.1] [pid=8021] Making callbacks (type 17)...
[1464018735.673011] [064.2] [pid=8021] Callback #1 (type 17) return code = 0
[1464018735.673120] [064.1] [pid=8021] Making callbacks (type 17)...
[1464018735.673246] [064.2] [pid=8021] Callback #1 (type 17) return code = 0
[1464018735.673430] [4096.2] [pid=8021] ## 1 descriptors had input
[1464018735.673527] [008.0] [pid=8021] Event was cancelled by iobroker input
[1464018735.686729] [4096.2] [pid=8021] ## 1 descriptors had input
[1464018735.686909] [008.0] [pid=8021] Event was cancelled by iobroker input
[1464018735.688423] [4096.2] [pid=8021] ## 1 descriptors had input
[1464018735.688690] [008.0] [pid=8021] Event was cancelled by iobroker input
[1464018735.690305] [4096.2] [pid=8021] ## 1 descriptors had input
[1464018735.690477] [008.0] [pid=8021] Event was cancelled by iobroker input
[1464018737.193456] [4096.2] [pid=8021] ## 0 descriptors had input
[1464018738.695988] [4096.2] [pid=8021] ## 0 descriptors had input
[1464018740.198133] [4096.2] [pid=8021] ## 0 descriptors had input
[1464018741.700197] [4096.2] [pid=8021] ## 0 descriptors had input
[1464018743.202594] [4096.2] [pid=8021] ## 0 descriptors had input
[1464018744.704749] [4096.2] [pid=8021] ## 0 descriptors had input

It basically looks the same on a machine where it is working. The error is reproducable on these machines, altought they do not seem to have any relation. The failed test vms are Fedora 23 (32bit) Sles 11SP2 (64bit) and Ubuntu16.04 (32bit). All other test VMs are ok.

catharsis commented 8 years ago

That is peculiar... How is the command submitted? Via the queryhandler or the command pipe? If the former, do you get any failure results back? If the latter, could you try with the former and report back what results you get?

catharsis commented 8 years ago

For what it's worth, I'm not able to reproduce this on Sles 11SP4 (I know, not the same), running in a docker:

sh-3.2# tail -f /var/log/naemon/naemon.log &
[1] 11385
sh-3.2# echo "[0] RESTART_PROGRAM" > /var/lib/naemon/naemon.cmd 
sh-3.2# [1464064529] EXTERNAL COMMAND: RESTART_PROGRAM;
[1464064529] Event broker module 'NERD' deinitialized successfully.
[1464064529] Warning: enable_environment_macros is deprecated and will be removed.
[1464064529] Naemon 1.0.4-source starting... (PID=11331)
[1464064529] Local time is Tue May 24 04:35:29 UTC 2016
[1464064529] LOG VERSION: 2.0
[1464064529] qh: Socket '/var/lib/naemon/naemon.qh' successfully initialized
[1464064529] nerd: Channel hostchecks registered successfully
[1464064529] nerd: Channel servicechecks registered successfully
[1464064529] nerd: Fully initialized and ready to rock!
[1464064529] wproc: Successfully registered manager as @wproc with query handler
[1464064529] wproc: Registry request: name=Core Worker 11387;pid=11387
[1464064529] wproc: Registry request: name=Core Worker 11388;pid=11388
[1464064529] wproc: Registry request: name=Core Worker 11390;pid=11390
[1464064529] wproc: Registry request: name=Core Worker 11389;pid=11389
[1464064529] wproc: Registry request: name=Core Worker 11391;pid=11391
[1464064529] wproc: Registry request: name=Core Worker 11392;pid=11392
sni commented 8 years ago

The command is submitted by Thruk via Livestatus which ends in the queryhandler. I also tried the command file or the query handler directly. All with the same result.

qh> #command run [0] RESTART_PROCESS
200: OK
qh>
catharsis commented 8 years ago

@sni Just to clarify, neither of them causes Naemon to actually restart?

sni commented 8 years ago

exactly

catharsis commented 8 years ago

I wish I was able to reproduce this. I'll have a look at the code to see if I can get any hints, but I really doubt that will get me anywhere... Maybe I'll find some log level that could be enabled that might provide a clue.

Have you tried any of the previous builds? It would be very useful to get at least a hint of when the problem was introduced... If we could pinpoint an exact commit by doing a git bisect or something, that would be even better ;)

edit: Oh, also - log_external_commands is on right? But you're not seeing a [1464090888] EXTERNAL COMMAND: RESTART_PROGRAM; in your naemon.log?

sni commented 8 years ago

for the record... log_external_commands is enabled and i can see the [1464095320] EXTERNAL COMMAND: RESTART_PROGRAM; in the logs

catharsis commented 8 years ago

This also affects the SHUTDOWN_PROGRAM command, which makes me believe that this (somewhat surprisingly) has something to do with the event scheduler. I'll keep looking.

catharsis commented 8 years ago

This is fixed by the inclusion of f10ee7b653d7d81cac1231856de3643effaeda05 and 21c9601390a2a029b7bc4c2e48416c7eaf1960ff.