Closed jimklimov closed 3 months ago
# gdb -p 2232101
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
...
Attaching to process 2232101
[New LWP 2232132]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f5ce9f07e23 in __GI___select (nfds=nfds@entry=16,
readfds=readfds@entry=0x7fff66ef2ca0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff66ef2c70)
at ../sysdeps/unix/sysv/linux/select.c:41
41 ../sysdeps/unix/sysv/linux/select.c: No such file or directory.
(gdb) bt
#0 0x00007f5ce9f07e23 in __GI___select (nfds=nfds@entry=16,
readfds=readfds@entry=0x7fff66ef2ca0, writefds=writefds@entry=0x0,
exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff66ef2c70)
at ../sysdeps/unix/sysv/linux/select.c:41
#1 0x00005576e42132af in dstate_poll_fds (timeout=..., arg_extrafd=-1)
at dstate.c:1065
#2 0x00005576e4203700 in main (argc=<optimized out>, argv=<optimized out>)
at main.c:2697
(gdb) q
I've heard the debugger start! The traced process was stalled for a minute, and CPU fans went off...
Seems indeed like a socket closet on the other side:
Checking if this can be a case of a writer preparing their data but the buffer (kernel?) not yet ready to spit it out, especially when CPU is too busy to work due to this tight cycle (per one of those discussions).
Commit 3d3f41bbe is from March and there were no changes to dstate.c
since. So the bug is part of at least NUT v2.8.2 release.
Got a rough guess then that this is a fallout of work on extending Unix socket protocol to send signals (reload etc.) - if a sibling driver process (or upsdrvctl
) connects to the socket file, spews "go reload" and exits - and the running daemon does not close such sockets, it would add such "is ready/got nothing" sockets and tight polling loops. If this is the case, probably the bug is there since forever, and restarting upsd
while a driver is alive would reproduce it too.
The new feature #1903 (et al) just exposes it more (old lifecycle of the daemon did not deal with many socket connections/disconnections), and probably upsd
does have ways around that since it deals with a population of drivers coming and going (or also has the bug?..)
Indeed, sending a reload command via socket increases the open file descriptor count on the driver daemon:
# ls -la /proc/2232101/fd
total 0
dr-x------ 2 root root 0 Jun 19 23:15 .
dr-xr-xr-x 9 nut nut 0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 19 23:15 0 -> /dev/null
lrwx------ 1 root root 64 Jun 19 23:15 1 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 10 -> 'socket:[608870464]'
lrwx------ 1 root root 64 Jun 19 23:15 11 -> 'socket:[608877633]'
lrwx------ 1 root root 64 Jun 19 23:15 12 -> 'socket:[608880779]'
lrwx------ 1 root root 64 Jun 19 23:15 13 -> 'socket:[608880826]'
lrwx------ 1 root root 64 Jun 19 23:15 14 -> 'socket:[608985678]'
lrwx------ 1 root root 64 Jun 19 23:15 15 -> 'socket:[608985715]'
lrwx------ 1 root root 64 Jun 20 09:29 16 -> 'socket:[2054060831]'
lrwx------ 1 root root 64 Jun 19 23:15 2 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 3 -> 'socket:[608864985]'
lrwx------ 1 root root 64 Jun 19 23:15 4 -> 'socket:[608864991]'
lrwx------ 1 root root 64 Jun 19 23:15 5 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 6 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 7 -> 'anon_inode:[timerfd]'
lrwx------ 1 root root 64 Jun 19 23:15 8 -> /dev/bus/usb/003/002
lrwx------ 1 root root 64 Jun 19 23:15 9 -> 'socket:[608864012]'
# /usr/local/ups/bin/usbhid-ups -c reload -a eco650
Network UPS Tools - Generic HID driver 0.53 (2.8.2-113-g3d3f41bbe)
USB communication driver (libusb 1.0) 0.48
Request for driver to reload-or-error returned code 0
upsnotify: notify about state 4 with libsystemd: was requested, but not running as a service unit now, will not spam more about it
upsnotify: failed to notify about state 4: no notification tech defined, will not spam more about it
# ls -la /proc/2232101/fd
total 0
dr-x------ 2 root root 0 Jun 19 23:15 .
dr-xr-xr-x 9 nut nut 0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 19 23:15 0 -> /dev/null
lrwx------ 1 root root 64 Jun 19 23:15 1 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 10 -> 'socket:[608870464]'
lrwx------ 1 root root 64 Jun 19 23:15 11 -> 'socket:[608877633]'
lrwx------ 1 root root 64 Jun 19 23:15 12 -> 'socket:[608880779]'
lrwx------ 1 root root 64 Jun 19 23:15 13 -> 'socket:[608880826]'
lrwx------ 1 root root 64 Jun 19 23:15 14 -> 'socket:[608985678]'
lrwx------ 1 root root 64 Jun 19 23:15 15 -> 'socket:[608985715]'
lrwx------ 1 root root 64 Jun 20 09:29 16 -> 'socket:[2054060831]'
lrwx------ 1 root root 64 Jun 20 09:30 17 -> 'socket:[2054065463]'
lrwx------ 1 root root 64 Jun 19 23:15 2 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 3 -> 'socket:[608864985]'
lrwx------ 1 root root 64 Jun 19 23:15 4 -> 'socket:[608864991]'
lrwx------ 1 root root 64 Jun 19 23:15 5 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 6 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 7 -> 'anon_inode:[timerfd]'
lrwx------ 1 root root 64 Jun 19 23:15 8 -> /dev/bus/usb/003/002
lrwx------ 1 root root 64 Jun 19 23:15 9 -> 'socket:[608864012]'
And upsd
does not suffer this situation, killing a driver closes the server-side file descriptor:
# ps -ef | grep '/ups/'
nut 2232101 1 54 Apr19 ? 33-21:03:41 /usr/local/ups/bin/usbhid-ups -FF -a eco650
nut 2232567 1 0 Apr19 ? 00:06:36 /usr/local/ups/sbin/upsd -FF
root 2232570 1 0 Apr19 ? 00:00:00 /usr/local/ups/sbin/upsmon -F
nut 2232571 2232570 0 Apr19 ? 00:01:41 /usr/local/ups/sbin/upsmon -F
nut 3096367 3003888 44 09:34 pts/9 00:00:57 /usr/local/ups/bin/dummy-ups -a dummy -D
root 3104521 3003888 0 09:36 pts/9 00:00:00 grep /ups/
# ls -la /proc/2232567/fd
total 0
dr-x------ 2 root root 0 Jun 20 02:10 .
dr-xr-xr-x 9 nut nut 0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 20 02:10 0 -> /dev/null
lrwx------ 1 root root 64 Jun 20 02:10 1 -> 'socket:[608868642]'
lrwx------ 1 root root 64 Jun 20 02:10 2 -> 'socket:[608868642]'
lrwx------ 1 root root 64 Jun 20 02:10 3 -> 'socket:[608868643]'
lrwx------ 1 root root 64 Jun 20 02:10 4 -> 'socket:[608868646]'
lrwx------ 1 root root 64 Jun 20 02:10 5 -> 'socket:[608868647]'
lrwx------ 1 root root 64 Jun 20 02:10 6 -> 'socket:[608868651]'
lrwx------ 1 root root 64 Jun 20 02:10 7 -> 'socket:[608868657]'
lrwx------ 1 root root 64 Jun 20 09:35 8 -> 'socket:[2054123072]'
# kill 3096367
Signal 15: exiting
[1]+ Done /usr/local/ups/bin/dummy-ups -a dummy -D
# ls -la /proc/2232567/fd
total 0
dr-x------ 2 root root 0 Jun 20 02:10 .
dr-xr-xr-x 9 nut nut 0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 20 02:10 0 -> /dev/null
lrwx------ 1 root root 64 Jun 20 02:10 1 -> 'socket:[608868642]'
lrwx------ 1 root root 64 Jun 20 02:10 2 -> 'socket:[608868642]'
lrwx------ 1 root root 64 Jun 20 02:10 3 -> 'socket:[608868643]'
lrwx------ 1 root root 64 Jun 20 02:10 4 -> 'socket:[608868646]'
lrwx------ 1 root root 64 Jun 20 02:10 5 -> 'socket:[608868647]'
lrwx------ 1 root root 64 Jun 20 02:10 6 -> 'socket:[608868651]'
lrwx------ 1 root root 64 Jun 20 02:10 7 -> 'socket:[608868657]'
UPDATE: also checked with kill -9
so the driver has no chance to say it is exiting.
For a driver is easy to reproduce, using the -c reload
feature from #1906 :
# ls -la /proc/3117986/fd
total 0
dr-x------ 2 root root 0 Jun 20 09:42 .
dr-xr-xr-x 9 nut nut 0 Jun 20 09:41 ..
lrwx------ 1 root root 64 Jun 20 09:42 0 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 1 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 2 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 3 -> 'socket:[2054195331]'
lr-x------ 1 root root 64 Jun 20 09:42 4 -> /etc/nut/dummy.dev
lrwx------ 1 root root 64 Jun 20 09:42 5 -> 'socket:[2054195349]'
lrwx------ 1 root root 64 Jun 20 09:42 6 -> 'socket:[2054195372]'
# /usr/local/ups/bin/dummy-ups -a dummy -c reload
Network UPS Tools - Device simulation and repeater driver 0.19 (2.8.2-113-g3d3f41bbe)
Request for driver to reload-or-error returned code 0
upsnotify: notify about state 4 with libsystemd: was requested, but not running as a service unit now, will not spam more about it
upsnotify: failed to notify about state 4: no notification tech defined, will not spam more about it
# ls -la /proc/3117986/fd
total 0
dr-x------ 2 root root 0 Jun 20 09:42 .
dr-xr-xr-x 9 nut nut 0 Jun 20 09:41 ..
lrwx------ 1 root root 64 Jun 20 09:42 0 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 1 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 2 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 3 -> 'socket:[2054195331]'
lr-x------ 1 root root 64 Jun 20 09:42 4 -> /etc/nut/dummy.dev
lrwx------ 1 root root 64 Jun 20 09:42 5 -> 'socket:[2054195349]'
lrwx------ 1 root root 64 Jun 20 09:42 6 -> 'socket:[2054195372]'
lrwx------ 1 root root 64 Jun 20 09:42 7 -> 'socket:[2054195476]'
...so an eventual fix would be easy to check.
Hm, O_NDELAY
vs. O_NONBLOCK
: gotta check which one we use (the former is per older standards (pre-1988!) and causes read()
to return 0, indiscernible from EOF; the latter is more informative): https://comp.unix.programmer.narkive.com/cFN0PRBQ/select-rtns-ready-read-returns-0#post12
Oh...
$ git grep O_NDELAY
clients/upssched.c: ret = fcntl(acc, F_SETFL, ret | O_NDELAY);
clients/upssched.c: upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on unix fd failed");
clients/upssched.c: /* O_NDELAY with zero bytes means nothing to read but
drivers/clone-outlet.c: ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
drivers/clone-outlet.c: upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on UPS [%s] failed", device_path);
drivers/clone.c: ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
drivers/clone.c: upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on UPS [%s] failed", device_path);
drivers/dstate.c: ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
drivers/dstate.c: upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on unix fd failed");
drivers/upsdrvquery.c: if (fcntl(conn->sockfd, F_SETFL, ret | O_NDELAY) < 0) {
drivers/upsdrvquery.c: upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on driver socket %s failed", sockfn);
server/sockdebug.c: ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
server/sstate.c: ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
server/sstate.c: upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on UPS [%s] failed", ups->name);
server/upsd.c: if (fcntl(sock_fd, F_SETFL, v | O_NDELAY) == -1) {
$ git grep O_NONBLOCK
NEWS.adoc: - The liebert driver now uses O_NONBLOCK, and should now work
clients/nutclient.cpp: fd_flags |= O_NONBLOCK;
clients/nutclient.cpp: fd_flags &= ~O_NONBLOCK;
clients/upsclient.c: fd_flags |= O_NONBLOCK;
clients/upsclient.c: fd_flags &= ~O_NONBLOCK;
drivers/apcupsd-ups.c: fd_flags |= O_NONBLOCK;
drivers/apcupsd-ups.c: upsdebugx(1,"unexpected fcntl(fd, F_SETFL, fd_flags|O_NONBLOCK) failure");
drivers/belkinunv.c: fd = open(device, O_RDWR | O_NONBLOCK);
drivers/serial.c: fd = open(port, O_RDWR | O_NOCTTY | O_EXCL | O_NONBLOCK);
# Hack for WIN32: gotta have something
drivers/serial.h:# define O_NONBLOCK 0
As some posts say, on many systems the two definitions are identical. In practice, quick replacement in dstate.c
and upsdrvquery.c
did not help (the latter actually checks for if (upsdrvquery_read_timeout(conn, tv) < 1) goto socket_error;
cases.
The upssched.c
offers an example of dealing with this already:
The room was noisy and warm. PC investigation led to some discoveries:
strace
shows intensive looping (about 15k lines per second) for what seems to be empty reads of a bit too many sockets (e.g. some detached wrongly so zombying around?) while log bumped to verbosity 6 does not show anything about these polls and is not that intensive (just detailed usb data updates, a few screens per sec)top excerpt:
CPU Load over the 2 months it runs is consistently around 1 core always busy.
Debugging is not enabled (was prime suspect):