networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.99k stars 349 forks source link

`usbhid-ups` sucks CPU in a tight `select` loop (2.8.2+ later branch build) #2484

Closed jimklimov closed 3 months ago

jimklimov commented 3 months ago

The room was noisy and warm. PC investigation led to some discoveries:

strace shows intensive looping (about 15k lines per second) for what seems to be empty reads of a bit too many sockets (e.g. some detached wrongly so zombying around?) while log bumped to verbosity 6 does not show anything about these polls and is not that intensive (just detailed usb data updates, a few screens per sec)

select(16, [9 10 11 12 13 14 15], NULL, NULL, {tv_sec=0, tv_usec=783537}) = 5 (in [11 12 13 14 15], left {tv_sec=0, tv_usec=783536})
read(15, "", 512)                       = 0
read(14, "", 512)                       = 0
read(13, "", 512)                       = 0
read(12, "", 512)                       = 0
read(11, "", 512)                       = 0
select(16, [9 10 11 12 13 14 15], NULL, NULL, {tv_sec=0, tv_usec=783468}) = 5 (in [11 12 13 14 15], left {tv_sec=0, tv_usec=783467})
read(15, "", 512)                       = 0
read(14, "", 512)                       = 0
read(13, "", 512)                       = 0
read(12, "", 512)                       = 0
read(11, "", 512)                       = 0
select(16, [9 10 11 12 13 14 15], NULL, NULL, {tv_sec=0, tv_usec=783399}) = 5 (in [11 12 13 14 15], left {tv_sec=0, tv_usec=783398})
read(15, "", 512)                       = 0
read(14, "", 512)                       = 0
^Cstrace: Process 2232101 detached
 <detached ...>

root@pve:~# ls -la /proc/2232101/fd
total 0
dr-x------ 2 root root  0 Jun 19 23:15 .
dr-xr-xr-x 9 nut  nut   0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 19 23:15 0 -> /dev/null
lrwx------ 1 root root 64 Jun 19 23:15 1 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 10 -> 'socket:[608870464]'
lrwx------ 1 root root 64 Jun 19 23:15 11 -> 'socket:[608877633]'
lrwx------ 1 root root 64 Jun 19 23:15 12 -> 'socket:[608880779]'
lrwx------ 1 root root 64 Jun 19 23:15 13 -> 'socket:[608880826]'
lrwx------ 1 root root 64 Jun 19 23:15 14 -> 'socket:[608985678]'
lrwx------ 1 root root 64 Jun 19 23:15 15 -> 'socket:[608985715]'
lrwx------ 1 root root 64 Jun 19 23:15 2 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 3 -> 'socket:[608864985]'
lrwx------ 1 root root 64 Jun 19 23:15 4 -> 'socket:[608864991]'
lrwx------ 1 root root 64 Jun 19 23:15 5 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 6 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 7 -> 'anon_inode:[timerfd]'
lrwx------ 1 root root 64 Jun 19 23:15 8 -> /dev/bus/usb/003/002
lrwx------ 1 root root 64 Jun 19 23:15 9 -> 'socket:[608864012]'

top excerpt:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     2232101 nut       20   0   16412   4108   3536 R  93.8   0.0  48437:18 usbhid-ups

CPU Load over the 2 months it runs is consistently around 1 core always busy. Temperature CPU Load

# systemctl status nut-driver@eco650 -l | cat
● nut-driver@eco650.service - Network UPS Tools - device driver for NUT device 'eco650'
     Loaded: loaded (/lib/systemd/system/nut-driver@.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/nut-driver@eco650.service.d
             └─nut-driver-enumerator-generated-checksum.conf, nut-driver-enumerator-generated-devicename.conf, nut-driver-enumerator-generated.conf
     Active: active (running) since Fri 2024-04-19 16:13:53 CEST; 2 months 0 days ago
   Main PID: 2232101 (usbhid-ups)
      Tasks: 2 (limit: 154419)
     Memory: 1.1M
        CPU: 1month 3d 4h 48min 41.608s
     CGroup: /system.slice/system-nut\x2ddriver.slice/nut-driver@eco650.service
             └─2232101 /usr/local/ups/bin/usbhid-ups -FF -a eco650

Warning: journal has been rotated since unit was started, output may be incomplete.

Debugging is not enabled (was prime suspect):

:; upsc eco650
battery.charge: 100
battery.charge.low: 20
battery.runtime: 1268
battery.type: PbAc
device.mfr: EATON
device.model: Ellipse ECO 650
device.serial: 000000000
device.type: ups
driver.debug: 0
driver.flag.allow_killpower: 0
driver.name: usbhid-ups
driver.parameter.bus: 003
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.product: Ellipse ECO
driver.parameter.productid: FFFF
driver.parameter.serial: 000000000
driver.parameter.synchronous: auto
driver.parameter.vendor: EATON
driver.parameter.vendorid: 0463
driver.state: quiet
driver.version: 2.8.2-113-g3d3f41bbe
driver.version.data: MGE HID 1.46
driver.version.internal: 0.53
driver.version.usb: libusb-1.0.24 (API: 0x1000108)
input.transfer.high: 264
input.transfer.low: 161
outlet.1.desc: PowerShare Outlet 1
outlet.1.id: 2
outlet.1.status: on
outlet.1.switchable: yes
outlet.2.desc: PowerShare Outlet 2
outlet.2.id: 3
outlet.2.status: on
outlet.2.switchable: yes
outlet.desc: Main Outlet
outlet.id: 1
outlet.power: 25
outlet.switchable: no
output.frequency.nominal: 50
output.voltage: 230.0
output.voltage.nominal: 230
ups.beeper.status: disabled
ups.delay.shutdown: 20
ups.delay.start: 30
ups.firmware: 02
ups.load: 19
ups.mfr: EATON
ups.model: Ellipse ECO 650
ups.power.nominal: 650
ups.productid: ffff
ups.realpower: 99
ups.serial: 000000000
ups.status: OL
ups.timer.shutdown: -1
ups.timer.start: -1
ups.vendorid: 0463
jimklimov commented 3 months ago
# gdb -p 2232101
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
...
Attaching to process 2232101
[New LWP 2232132]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f5ce9f07e23 in __GI___select (nfds=nfds@entry=16,
    readfds=readfds@entry=0x7fff66ef2ca0, writefds=writefds@entry=0x0,
    exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff66ef2c70)
    at ../sysdeps/unix/sysv/linux/select.c:41
41      ../sysdeps/unix/sysv/linux/select.c: No such file or directory.
(gdb) bt
#0  0x00007f5ce9f07e23 in __GI___select (nfds=nfds@entry=16,
    readfds=readfds@entry=0x7fff66ef2ca0, writefds=writefds@entry=0x0,
    exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff66ef2c70)
    at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x00005576e42132af in dstate_poll_fds (timeout=..., arg_extrafd=-1)
    at dstate.c:1065
#2  0x00005576e4203700 in main (argc=<optimized out>, argv=<optimized out>)
    at main.c:2697
(gdb) q

I've heard the debugger start! The traced process was stalled for a minute, and CPU fans went off...

jimklimov commented 3 months ago

Seems indeed like a socket closet on the other side:

Checking if this can be a case of a writer preparing their data but the buffer (kernel?) not yet ready to spit it out, especially when CPU is too busy to work due to this tight cycle (per one of those discussions).

Commit 3d3f41bbe is from March and there were no changes to dstate.c since. So the bug is part of at least NUT v2.8.2 release.

Got a rough guess then that this is a fallout of work on extending Unix socket protocol to send signals (reload etc.) - if a sibling driver process (or upsdrvctl) connects to the socket file, spews "go reload" and exits - and the running daemon does not close such sockets, it would add such "is ready/got nothing" sockets and tight polling loops. If this is the case, probably the bug is there since forever, and restarting upsd while a driver is alive would reproduce it too.

The new feature #1903 (et al) just exposes it more (old lifecycle of the daemon did not deal with many socket connections/disconnections), and probably upsd does have ways around that since it deals with a population of drivers coming and going (or also has the bug?..)

jimklimov commented 3 months ago

Indeed, sending a reload command via socket increases the open file descriptor count on the driver daemon:

# ls -la /proc/2232101/fd
total 0
dr-x------ 2 root root  0 Jun 19 23:15 .
dr-xr-xr-x 9 nut  nut   0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 19 23:15 0 -> /dev/null
lrwx------ 1 root root 64 Jun 19 23:15 1 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 10 -> 'socket:[608870464]'
lrwx------ 1 root root 64 Jun 19 23:15 11 -> 'socket:[608877633]'
lrwx------ 1 root root 64 Jun 19 23:15 12 -> 'socket:[608880779]'
lrwx------ 1 root root 64 Jun 19 23:15 13 -> 'socket:[608880826]'
lrwx------ 1 root root 64 Jun 19 23:15 14 -> 'socket:[608985678]'
lrwx------ 1 root root 64 Jun 19 23:15 15 -> 'socket:[608985715]'
lrwx------ 1 root root 64 Jun 20 09:29 16 -> 'socket:[2054060831]'
lrwx------ 1 root root 64 Jun 19 23:15 2 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 3 -> 'socket:[608864985]'
lrwx------ 1 root root 64 Jun 19 23:15 4 -> 'socket:[608864991]'
lrwx------ 1 root root 64 Jun 19 23:15 5 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 6 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 7 -> 'anon_inode:[timerfd]'
lrwx------ 1 root root 64 Jun 19 23:15 8 -> /dev/bus/usb/003/002
lrwx------ 1 root root 64 Jun 19 23:15 9 -> 'socket:[608864012]'

# /usr/local/ups/bin/usbhid-ups -c reload -a eco650
Network UPS Tools - Generic HID driver 0.53 (2.8.2-113-g3d3f41bbe)
USB communication driver (libusb 1.0) 0.48
Request for driver to reload-or-error returned code 0
upsnotify: notify about state 4 with libsystemd: was requested, but not running as a service unit now, will not spam more about it
upsnotify: failed to notify about state 4: no notification tech defined, will not spam more about it

# ls -la /proc/2232101/fd
total 0
dr-x------ 2 root root  0 Jun 19 23:15 .
dr-xr-xr-x 9 nut  nut   0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 19 23:15 0 -> /dev/null
lrwx------ 1 root root 64 Jun 19 23:15 1 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 10 -> 'socket:[608870464]'
lrwx------ 1 root root 64 Jun 19 23:15 11 -> 'socket:[608877633]'
lrwx------ 1 root root 64 Jun 19 23:15 12 -> 'socket:[608880779]'
lrwx------ 1 root root 64 Jun 19 23:15 13 -> 'socket:[608880826]'
lrwx------ 1 root root 64 Jun 19 23:15 14 -> 'socket:[608985678]'
lrwx------ 1 root root 64 Jun 19 23:15 15 -> 'socket:[608985715]'
lrwx------ 1 root root 64 Jun 20 09:29 16 -> 'socket:[2054060831]'
lrwx------ 1 root root 64 Jun 20 09:30 17 -> 'socket:[2054065463]'
lrwx------ 1 root root 64 Jun 19 23:15 2 -> 'socket:[608865694]'
lrwx------ 1 root root 64 Jun 19 23:15 3 -> 'socket:[608864985]'
lrwx------ 1 root root 64 Jun 19 23:15 4 -> 'socket:[608864991]'
lrwx------ 1 root root 64 Jun 19 23:15 5 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 6 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jun 19 23:15 7 -> 'anon_inode:[timerfd]'
lrwx------ 1 root root 64 Jun 19 23:15 8 -> /dev/bus/usb/003/002
lrwx------ 1 root root 64 Jun 19 23:15 9 -> 'socket:[608864012]'
jimklimov commented 3 months ago

And upsd does not suffer this situation, killing a driver closes the server-side file descriptor:

# ps -ef | grep '/ups/'
nut      2232101       1 54 Apr19 ?        33-21:03:41 /usr/local/ups/bin/usbhid-ups -FF -a eco650
nut      2232567       1  0 Apr19 ?        00:06:36 /usr/local/ups/sbin/upsd -FF
root     2232570       1  0 Apr19 ?        00:00:00 /usr/local/ups/sbin/upsmon -F
nut      2232571 2232570  0 Apr19 ?        00:01:41 /usr/local/ups/sbin/upsmon -F
nut      3096367 3003888 44 09:34 pts/9    00:00:57 /usr/local/ups/bin/dummy-ups -a dummy -D
root     3104521 3003888  0 09:36 pts/9    00:00:00 grep /ups/

# ls -la /proc/2232567/fd
total 0
dr-x------ 2 root root  0 Jun 20 02:10 .
dr-xr-xr-x 9 nut  nut   0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 20 02:10 0 -> /dev/null
lrwx------ 1 root root 64 Jun 20 02:10 1 -> 'socket:[608868642]'
lrwx------ 1 root root 64 Jun 20 02:10 2 -> 'socket:[608868642]'
lrwx------ 1 root root 64 Jun 20 02:10 3 -> 'socket:[608868643]'
lrwx------ 1 root root 64 Jun 20 02:10 4 -> 'socket:[608868646]'
lrwx------ 1 root root 64 Jun 20 02:10 5 -> 'socket:[608868647]'
lrwx------ 1 root root 64 Jun 20 02:10 6 -> 'socket:[608868651]'
lrwx------ 1 root root 64 Jun 20 02:10 7 -> 'socket:[608868657]'
lrwx------ 1 root root 64 Jun 20 09:35 8 -> 'socket:[2054123072]'

# kill 3096367
Signal 15: exiting
[1]+  Done                    /usr/local/ups/bin/dummy-ups -a dummy -D

# ls -la /proc/2232567/fd
total 0
dr-x------ 2 root root  0 Jun 20 02:10 .
dr-xr-xr-x 9 nut  nut   0 Apr 19 16:13 ..
lr-x------ 1 root root 64 Jun 20 02:10 0 -> /dev/null
lrwx------ 1 root root 64 Jun 20 02:10 1 -> 'socket:[608868642]'
lrwx------ 1 root root 64 Jun 20 02:10 2 -> 'socket:[608868642]'
lrwx------ 1 root root 64 Jun 20 02:10 3 -> 'socket:[608868643]'
lrwx------ 1 root root 64 Jun 20 02:10 4 -> 'socket:[608868646]'
lrwx------ 1 root root 64 Jun 20 02:10 5 -> 'socket:[608868647]'
lrwx------ 1 root root 64 Jun 20 02:10 6 -> 'socket:[608868651]'
lrwx------ 1 root root 64 Jun 20 02:10 7 -> 'socket:[608868657]'

UPDATE: also checked with kill -9 so the driver has no chance to say it is exiting.

jimklimov commented 3 months ago

For a driver is easy to reproduce, using the -c reload feature from #1906 :

# ls -la /proc/3117986/fd
total 0
dr-x------ 2 root root  0 Jun 20 09:42 .
dr-xr-xr-x 9 nut  nut   0 Jun 20 09:41 ..
lrwx------ 1 root root 64 Jun 20 09:42 0 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 1 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 2 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 3 -> 'socket:[2054195331]'
lr-x------ 1 root root 64 Jun 20 09:42 4 -> /etc/nut/dummy.dev
lrwx------ 1 root root 64 Jun 20 09:42 5 -> 'socket:[2054195349]'
lrwx------ 1 root root 64 Jun 20 09:42 6 -> 'socket:[2054195372]'

# /usr/local/ups/bin/dummy-ups -a dummy -c reload
Network UPS Tools - Device simulation and repeater driver 0.19 (2.8.2-113-g3d3f41bbe)
Request for driver to reload-or-error returned code 0
upsnotify: notify about state 4 with libsystemd: was requested, but not running as a service unit now, will not spam more about it
upsnotify: failed to notify about state 4: no notification tech defined, will not spam more about it

# ls -la /proc/3117986/fd
total 0
dr-x------ 2 root root  0 Jun 20 09:42 .
dr-xr-xr-x 9 nut  nut   0 Jun 20 09:41 ..
lrwx------ 1 root root 64 Jun 20 09:42 0 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 1 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 2 -> /dev/null
lrwx------ 1 root root 64 Jun 20 09:42 3 -> 'socket:[2054195331]'
lr-x------ 1 root root 64 Jun 20 09:42 4 -> /etc/nut/dummy.dev
lrwx------ 1 root root 64 Jun 20 09:42 5 -> 'socket:[2054195349]'
lrwx------ 1 root root 64 Jun 20 09:42 6 -> 'socket:[2054195372]'
lrwx------ 1 root root 64 Jun 20 09:42 7 -> 'socket:[2054195476]'

...so an eventual fix would be easy to check.

jimklimov commented 3 months ago

Hm, O_NDELAY vs. O_NONBLOCK: gotta check which one we use (the former is per older standards (pre-1988!) and causes read() to return 0, indiscernible from EOF; the latter is more informative): https://comp.unix.programmer.narkive.com/cFN0PRBQ/select-rtns-ready-read-returns-0#post12

jimklimov commented 3 months ago

Oh...

$ git grep O_NDELAY
clients/upssched.c:     ret = fcntl(acc, F_SETFL, ret | O_NDELAY);
clients/upssched.c:             upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on unix fd failed");
clients/upssched.c:                     /* O_NDELAY with zero bytes means nothing to read but
drivers/clone-outlet.c: ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
drivers/clone-outlet.c:         upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on UPS [%s] failed", device_path);
drivers/clone.c:        ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
drivers/clone.c:                upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on UPS [%s] failed", device_path);
drivers/dstate.c:               ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
drivers/dstate.c:                       upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on unix fd failed");
drivers/upsdrvquery.c:  if (fcntl(conn->sockfd, F_SETFL, ret | O_NDELAY) < 0) {
drivers/upsdrvquery.c:                  upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on driver socket %s failed", sockfn);
server/sockdebug.c:     ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
server/sstate.c:        ret = fcntl(fd, F_SETFL, ret | O_NDELAY);
server/sstate.c:                upslog_with_errno(LOG_ERR, "fcntl set O_NDELAY on UPS [%s] failed", ups->name);
server/upsd.c:          if (fcntl(sock_fd, F_SETFL, v | O_NDELAY) == -1) {

$ git grep O_NONBLOCK
NEWS.adoc: - The liebert driver now uses O_NONBLOCK, and should now work
clients/nutclient.cpp:                  fd_flags |= O_NONBLOCK;
clients/nutclient.cpp:                  fd_flags &= ~O_NONBLOCK;
clients/upsclient.c:                    fd_flags |= O_NONBLOCK;
clients/upsclient.c:                    fd_flags &= ~O_NONBLOCK;
drivers/apcupsd-ups.c:  fd_flags |= O_NONBLOCK;
drivers/apcupsd-ups.c:          upsdebugx(1,"unexpected fcntl(fd, F_SETFL, fd_flags|O_NONBLOCK) failure");
drivers/belkinunv.c:    fd = open(device, O_RDWR | O_NONBLOCK);
drivers/serial.c:       fd = open(port, O_RDWR | O_NOCTTY | O_EXCL | O_NONBLOCK);

# Hack for WIN32: gotta have something
drivers/serial.h:# define O_NONBLOCK 0
jimklimov commented 3 months ago

As some posts say, on many systems the two definitions are identical. In practice, quick replacement in dstate.c and upsdrvquery.c did not help (the latter actually checks for if (upsdrvquery_read_timeout(conn, tv) < 1) goto socket_error; cases.

The upssched.c offers an example of dealing with this already: