Closed desertwitch closed 10 months ago
As far as I know, with at least classic UDP-based SNMP, the lack of server or wrong password are indistinguishable: you fire a best effort UDP packet, no reply comes in, that's it. After an arbitrary timeout you decide to give up and do not know if the server is not available, or chose to not reply (bad auth, stressed, etc.), or either packet got lost in transit... There is no "session" whose state you can track to gauge success of the attempt, like with TCP. More or less the same can be seen with standard tools like snmpwalk
.
Just to clarify: in the environments you work with, are init-scripts still the "king of the day" (so e.g. NUT drivers' long init times can be blocking the appearance of interactive console login), or are service frameworks like systemd or SMF available? In particular, what I'd expect of "frameworks" includes:
So how much are you limited by requiring upsdrvctl
as a single program to start all drivers successfully and in one shot (or fail on any hiccup) as a unitary step in the system boot-up?
Regarding configurability for this, check https://github.com/networkupstools/nut/blob/master/conf/ups.conf.sample - in particular: maxretry
, retrydelay
and overall maxstartdelay
.
In trench experience of the 42ity project for massive-monitoring appliances, the burst of driver inits could overwhelm the appliance CPU to the point that it timed out every driver across the 3-minute boot allowance. Ended up making a special systemd unit that wrapped nut-driver@instancename.service
logic to keep those per-driver units off by default and to "manually" start them in batches of 30 or so.
Thanks for getting back to me in light speed (again), Jim.
As far as I know, with at least classic UDP-based SNMP, the lack of server or wrong password are indistinguishable: you fire a best effort UDP packet, no reply comes in, that's it. After an arbitrary timeout you decide to give up and do not know if the server is not available, or chose to not reply (bad auth, stressed, etc.), or either packet got lost in transit... There is no "session" whose state you can track to gauge success of the attempt, like with TCP. More or less the same can be seen with standard tools like snmpwalk.
This is also what I've gathered from my investigations into this - so it seems to be a cul-de-sac (protocol-wise).
Just to clarify: in the environments you work with, are init-scripts still the "king of the day" (so e.g. NUT drivers' long init times can be blocking the appearance of interactive console login), or are service frameworks like systemd or SMF available? In particular, what I'd expect of "frameworks" includes:
Init-scripts - but makes one get creative with such things... :-)
So how much are you limited by requiring upsdrvctl as a single program to start all drivers successfully and in one shot (or fail on any hiccup) as a unitary step in the system boot-up?
Not massively, but it's certainly a convenience not having to evaluate which drivers my end-users have configured. In any case I'd say most have max. 2 configured, if not just 1, so that's pretty much a non-issue (starting up too many at once).
Regarding configurability for this, check https://github.com/networkupstools/nut/blob/master/conf/ups.conf.sample - in particular: maxretry, retrydelay and overall maxstartdelay.
I had the feeling I missed something, sorry about that... But reading up on that does that make me wonder about that timeout.
# maxstartdelay: OPTIONAL. This can be set as a global variable
# above your first UPS definition and it can also be
# set in a UPS section. This value controls how long
# upsdrvctl will wait for the driver to finish starting.
# This keeps your system from getting stuck due to a
# broken driver or UPS.
# The default is 45 seconds.
Default (when not explicitly set) seems 45 seconds, resulting (probably) in this log line (45 secs after start):
Oct 28 02:47:16 Tower root: Startup timer elapsed, continuing...
Following that logic shouldn't the SNMP driver have been killed (here) after these 45 seconds elapsed?
Looking at code, seems it would retry (3 times by default in the ups.conf.sample
, 1 in code) so forking off new drivers. In fact, there does not seem to be any reaping for drivers that just timed out but did not die during forkexec()
- though usually they should avoid stepping on each other's toes by killing off an earlier instance themselves.
At least, if the timeouts you see are like 3*45 sec - that may be it.
Looking at code, seems it would retry (3 times by default in the
ups.conf.sample
, 1 in code) so forking off new drivers. In fact, there does not seem to be any reaping for drivers that just timed out but did not die duringforkexec()
- though usually they should avoid stepping on each other's toes by killing off an earlier instance themselves.At least, if the timeouts you see are like 3*45 sec - that may be it.
I'll investigate some more into this, because I don't have maxretry
set in my ups.conf
.
So it should default into 1(default maxretry
)*45(default maxstartdelay
) = 45 seconds
Those 45 seconds being the first and only time that log line appears:
Oct 28 02:47:16 Tower root: Startup timer elapsed, continuing...
But even with 3(maxretry
)*45(maxstartdelay
) = ~2,25 minutes (2,5 minutes with 3x the default retrydelay
of 5s)
... something doesn't add up (if my logic is not flawed somehow here... which is always possible 😆 )
# maxretry: OPTIONAL. Specify the number of attempts to start the driver(s),
# in case of failure, before giving up. A delay of 'retrydelay' is
# inserted between each attempt. Caution should be taken when using
# this option, since it can impact the time taken by your system to
# start.
#
# The built-in default is 1 attempt.
In my logs it seems the driver is only started once, and kills itself after the 4 minutes, rather than being killed by any upsdrvctl
timeouts - so perhaps upsdrvctl
is already considering the driver started at that point, but more likely it doesn't kill it as it should after this log line: Oct 28 02:47:16 Tower root: Startup timer elapsed, continuing...
. That is under the assumption that "continuing..." in this case means continuing to the next defined UPS (none, in my case) rather than continuing with the already running driver despite the "timer elapsed".
Diving into the code all these cases (except for the first) already have a dead fork at that point (and no need to terminate it)
Upstream function evaluating the error counter doesn't terminate, just checks whether to spawn new forks based on the options and whether an error was encountered (which it was, just not resulting in fork termination by it's own fatality): https://github.com/networkupstools/nut/blob/858db30fef2e6854b114eeb1306381b59c05f848/drivers/upsdrvctl.c#L893-L903
So it seems the fork never gets killed (because in all other cases it's already dead), so perhaps needs explicit killing here? https://github.com/networkupstools/nut/blob/858db30fef2e6854b114eeb1306381b59c05f848/drivers/upsdrvctl.c#L668-L672
I've tested this some more and it seems my assumption is right that an elapsed maxstartdelay does report an error to the upstream fork-spawning function (resulting in spawning of additional forks when maxretry > 1) but not in termination of the current time-elapsed fork. With the default setting of maxretry=1 the maxstartdelay basically does elapse (and does log it), no more forks get spawned (as intended), but that driver instance (fork) is never effectively terminated and runs forever until self-termination (as seen in the logs).
Usually not a problem because I'd assume most other drivers would exit on their own accord way before time-elapsing (and not needing killing in the process), but with SNMP that seems to be the case (it needing killing, which it doesn't seem to get).
But maxstartdelay seems to have no (killing, just fork-spawning) effect at present implementation, not as (probably) intended at least.
P.S For the SNMP driver a SIGTERM didn't seem to suffice at that stage (patching one in for test purposes), it needed a SIGKILL to effectively kill the fork after the timeout elapsed.
My expectation was that the newly forked+execed driver would find (e.g. by PID file) that a predecessor exists, and would kill it then. On upsdrvctl
side that previous attempt (if it did not fork off into the background again - something the debug flags can prevent by default) would probably become a zombie child process to be reaped explicitly (or eventually as upsdrvctl
itself ends and the OS reaps its resources).
As for a SIGTERM - I think it is evaluated between loops in main.c
, so if the driver is stuck in some time-consuming method for initialization, it might at best raise the "I'm terminated" flag with the signal handler, but never get to check/process it...
My expectation was that the newly forked+execed driver would find (e.g. by PID file) that a predecessor exists, and would kill it then. On
upsdrvctl
side that previous attempt (if it did not fork off into the background again - something the debug flags can prevent by default) would probably become a zombie child process to be reaped explicitly (or eventually asupsdrvctl
itself ends and the OS reaps its resources).As for a SIGTERM - I think it is evaluated between loops in
main.c
, so if the driver is stuck in some time-consuming method for initialization, it might at best raise the "I'm terminated" flag with the signal handler, but never get to check/process it...
That it does, but with a default maxretry=1 a second fork (killing the first one by detection) is never spawned (as intended), and the first fork just keeps running way past our maxstartdelay until (possible) self-termination (as it's never killed by a successive fork). The same applies with the last of multiple maxretry set as there's no successive fork killing it - the last one keeps running.
So methinks every such timeout error (possibly resulting or not resulting in further fork-spawning) should finish with a termination of the current (departing but possibly still running) fork and not rely on a successive fork to terminate it. Otherwise the option won't have its intended effect, when the last fork (of maxretry value X) is never terminated by the timeout, just the intermediate ones.
I think the idea is to try and start the driver after all :) So if it takes too long, upsdrvctl
takes note and will probably return an error exit-code eventually, but marches on to start another driver (if there are many) while the attempted driver's forked process is on its own and no longer monitored to succeed or fail the start-up.
It would be counter-productive here to be pedantic and kill drivers just because they did not talk to some remote honeypot quickly enough.
BTW if wrapping into systemd/SMF units, keep in mind the framework settings for a process to start up (90 sec by default in systemd) and possibly tune that up for the slower remote devices. Especially the SNMP NMCs tended to time out or reboot under stress for us (when e.g. 30 clients bombard the same poor card with a full walk as they boot up).
I think the idea is to try and start the driver after all :) So if it takes too long,
upsdrvctl
takes note and will probably return an error exit-code eventually, but marches on to start another driver (if there are many) while the attempted driver's forked process is on its own and no longer monitored to succeed or fail the start-up.It would be counter-productive here to be pedantic and kill drivers just because they did not talk to some remote honeypot quickly enough.
BTW if wrapping into systemd/SMF units, keep in mind the framework settings for a process to start up (90 sec by default in systemd) and possibly tune that up for the slower remote devices. Especially the SNMP NMCs tended to time out or reboot under stress for us (when e.g. 30 clients bombard the same poor card with a full walk as they boot up).
Makes sense when you say it like that. I'm good with keeping this as is with likely only one not immediately error-conscious driver affected. Wondering if it'd make sense to put a side-note as part of the maxstartdelay setting description that the timeout is more intended towards ensuring successive driver startup rather than cleaning up stale driver instances. But I'm good as far as this issue is concerned... thanks for the time as usual, Jim, and sorry if it caused any inconvenience.
A docs update may be worthwhile. After all, it is from outside that the missing bits are best visible :)
Thanks for the commit, one last thing I was wondering about when reading your side-note:
Note that after this time upsdrvctl would just move along with its business (whether retrying the same driver if
maxretry>1
, or trying another driver if starting them all, or just exit); however the most recently started "stuck" driver process may be further initializing in the background, and might even succeed eventually.
upsdrvctl
doesn't actually exit until the SNMP driver self-terminates after the 4 minutes.
It doesn't let the driver do its hopeless thing in the background either, as I understand from the above paragraph.
That is with 1 driver configured and default settings of maxretry=1, maxstartdelay=45 seconds - this is still "normal", right?
Not trying to pick words or be overly pedantic, just sanity-checking we were talking about the same thing. :-)
Well, normally the drivers initialize, fork to background (unless told not to), and begin the infinite polling loop. So if the driver would succeed eventually - its thing should not be hopeless :)
Then it depends on the system around it. If e.g. upsdrvctl
returns a non-zero (error) exit-code because it had some hiccups, some systemd might restart the unit anyway. If timings are wrong for the practical situation, I guess it might even "boot-loop" the nut-driver(s) unit this way...
Not sure what to state about upsdrvctl not exiting until child processes actually finish (fail or fork)... I suppose, that is correct (and notes can be further reworded about "eventual" exit). For a start of single driver might as well wait and pass its result code.
Not sure what to state about upsdrvctl not exiting until child processes actually finish (fail or fork)... I suppose, that is correct (and notes can be further reworded about "eventual" exit). For a start of single driver might as well wait and pass its result code.
All good then, many thanks for going down that rabbit hole with me. As far as I'm concerned, issue can be closed since there seems little room for improvement SNMP-wise.
... and again, thanks for keeping the NUT-wheel spinning for us! ;-)
Update about not blocking the system startup due to upsdrvctl
- check the nowait
flag too.
@desertwitch : do you think solvable issues here have been solved?
I guess that for SNMP over UDP the inability to discern un-handled requests from dead hosts is more or less fundamental, so we have to try all detection OIDs etc. and come to conclusions after receiving no replies.
Looking at nut_snmp_init()
we do not currently customize ports (non-161) nor protocol types (TCP); perhaps lack-of-host could be more visible with TCP - if the device talks that, and more so if it does not :) But I guess someone with actual devices to tap would have to implement this, as well as config file syntax changes.
Yep, I'm good here - thanks - feel free to close!
CC @arnaudquette-eaton @dzomaya : got any interest in enhancing SNMP low-level support per comments above?
I guess I'll offload the specific idea into the issue tracker as a separate thing.
Reading yesterday's e-mail in the mailing list regarding a more or less imminent 2.8.1. release, I've decided to do some more testing. I've actually run into this problem (with a stuck system boot) before but haven't been able to trace it back to NUT and the SNMP driver.
At present the SNMP driver will start time-intensive, futile attempts to walk all kinds of MIBs on target hosts which are non-existent or do not even have a SNMP server listening. It will spend almost 4 minutes attemping SNMP operations on a dead/unresponsive target host before determining there's no connection possible or readable SNMP available and exiting.
This can (especially at system boot and with home-brewed startup scripts or badly implemented service environments) cause an unwanted delay in the whole boot routine, made much harder to trace by the fact that the
snmp-ups
driver itself does not present any log (only debug > 1) messages during these futile attempts until after it eventually gives up after the ~4 minutes (in my case).Regarding
upsdrvctl
I've - in particular - noticed this line in the log file (see below for full log):
Is it intended that
upsdrvctl
let the driver continue (which at present it does) after this point or should it kill it?I have RTFM but I didn't find a startup timer option after which (I presume) the
upsdrvctl
should give up on a driver? If there isn't one - would this be a useful addition worth looking into forupsdrvctl
?Regarding
snmp-ups
Not sure how, but possibly we can (sanity-)check if the host and/or SNMP ports are connectable on driver launch? If it isn't, then it seems it doesn't make much sense to run all these time- and network-intensive SNMP operations.
These are (debug level 4) log messages having to driver attempt to connect to a non-existent IP on my network: