networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
2.07k stars 350 forks source link

dummy-ups consumes much CPU #567

Open jimklimov opened 6 years ago

jimklimov commented 6 years ago

We have a system that acts as a sort of NUT proxy, to alleviate actual power devices during testing, and to provide some virtual devices - both roles implemented by dummy-ups from 42ity fork of NUT (should be on par with github master). The system was restarted about a week ago, and today I've noticed that the dummy-ups drivers that serve "dev" files consume a lot of CPU (50-70% of a core each) despite the files ending with a TIMER 60 line.

Restarting the driver in debug mode did not help - it behaves as it should, reading the file and then quickly and quietly refreshing the state ("I'm not stale") with no load on the system.

While straceing the driver with high load, which ran for a week now, I see it doing the following loop:

...
10:48:36 select(8, [5 6 7], NULL, NULL, {0, 352}) = 1 (in [6], left {0, 350})
10:48:36 read(6, "", 512)               = 0
10:48:36 gettimeofday({1530262116, 859989}, NULL) = 0
10:48:36 select(8, [5 6 7], NULL, NULL, {0, 246}) = 1 (in [6], left {0, 243})
10:48:36 read(6, "", 512)               = 0
10:48:36 gettimeofday({1530262116, 860095}, NULL) = 0
10:48:36 select(8, [5 6 7], NULL, NULL, {0, 140}) = 1 (in [6], left {0, 138})
10:48:36 read(6, "", 512)               = 0
10:48:36 gettimeofday({1530262116, 860208}, NULL) = 0
10:48:36 select(8, [5 6 7], NULL, NULL, {0, 27}) = 1 (in [6], left {0, 25})
10:48:36 read(6, "", 512)               = 0
10:48:36 gettimeofday({1530262116, 860312}, NULL) = 0
10:48:36 select(8, [5 6 7], NULL, NULL, {0, 0}) = 1 (in [6], left {0, 0})
10:48:36 read(6, "", 512)               = 0
10:48:36 gettimeofday({1530262116, 860416}, NULL) = 0

(many screenfuls of this spit out quickly)

then a sleep:

10:48:36 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
10:48:36 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
10:48:36 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
10:48:36 nanosleep({1, 0},

followed by another burst like above

with occasional occurrences of

11:10:31 gettimeofday({1530263431, 513516}, NULL) = 0
11:10:31 select(8, [5 6 7], NULL, NULL, {0, 577422}) = 1 (in [6], left {0, 577419})
11:10:31 read(6, "", 512)               = 0
11:10:31 gettimeofday({1530263498, 611545}, NULL) = 0
11:11:38 select(8, [5 6 7], NULL, NULL, {0, 0}) = 2 (in [6 7], left {0, 0})
11:11:38 read(7, "PING\nPING\nPING\nPING\nPING\nPING\nPI"..., 512) = 55
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 write(7, "PONG\n", 5)          = 5
11:11:38 read(6, "", 512)               = 0
11:11:38 gettimeofday({1530263498, 612809}, NULL) = 0
11:11:38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
11:11:38 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
11:11:38 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
11:11:38 nanosleep({1, 0},

For the load, top says this:

top - 11:13:37 up 556 days, 22:39,  2 users,  load average: 1.62, 1.70, 1.73
Tasks: 431 total,   4 running, 427 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us,  1.0 sy,  0.0 ni, 96.6 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32239872 total, 24631604 used,  7608268 free,   261504 buffers
KiB Swap: 10790911+total,  9342884 used, 98566224 free. 20079612 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  605 nut       20   0   25688    600    408 R 100.0  0.0   2796:38 dummy-ups
  606 nut       20   0   25736    576    408 R  97.6  0.0   2797:12 dummy-ups
  603 nut       20   0   25556    528    408 S  73.2  0.0   2795:55 dummy-ups
  604 nut       20   0   25556    476    404 S  73.2  0.0   2795:48 dummy-ups
  607 nut       20   0   25736    596    408 R  18.3  0.0   2795:21 dummy-ups
   13 root      20   0       0      0      0 S  12.2  0.0   5555:48 ksoftirqd/1
23149 root      20   0   25876   3112   2456 R  12.2  0.0   0:00.02 top
clepple commented 6 years ago

If you still have the strace output, or if you can grab this from /proc/PID/fd, what is fd#6 connected to?

If it is the dev file, then maybe the code is not handling EOF correctly?

Those listed in readfds will be watched to see if characters become available for reading (more precisely, to see if a read will not block; in particular, a file descriptor is also ready on end-of-file), ...

I'm not sure why the PINGs are not getting flushed.

jimklimov commented 6 years ago

Some unspecified socket; the dev-file is fd#4:

# ls -la /proc/60?/fd/*

lrwx------ 1 root root 64 Jun 29 14:37 /proc/602/fd/0 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/602/fd/1 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/602/fd/2 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/602/fd/3 -> socket:[2029779593]
lr-x------ 1 root root 64 Jun 29 14:37 /proc/602/fd/4 -> /etc/nut/Eaton__UPS_9PX_snmp_pw.dev
lrwx------ 1 root root 64 Jun 29 14:37 /proc/602/fd/5 -> socket:[2029779623]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/602/fd/6 -> socket:[2029812746]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/602/fd/7 -> socket:[2041957512]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/603/fd/0 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/603/fd/1 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/603/fd/2 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/603/fd/3 -> socket:[2029787142]
lr-x------ 1 root root 64 Jun 29 14:38 /proc/603/fd/4 -> /etc/nut/Eaton__UPS_9PX_netxml.dev
lrwx------ 1 root root 64 Jun 29 14:37 /proc/603/fd/5 -> socket:[2029787169]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/603/fd/6 -> socket:[2029812750]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/603/fd/7 -> socket:[2041962657]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/604/fd/0 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/604/fd/1 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/604/fd/2 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/604/fd/3 -> socket:[2029779599]
lr-x------ 1 root root 64 Jun 29 14:37 /proc/604/fd/4 -> /etc/nut/Eaton__UPS_9PX_snmp_mge.dev
lrwx------ 1 root root 64 Jun 29 14:37 /proc/604/fd/5 -> socket:[2029779625]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/604/fd/6 -> socket:[2029812748]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/604/fd/7 -> socket:[2041937804]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/605/fd/0 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/605/fd/1 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/605/fd/2 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/605/fd/3 -> socket:[2029782520]
lr-x------ 1 root root 64 Jun 29 14:37 /proc/605/fd/4 -> /etc/nut/Eaton__EPDU_MI_0U_(309W_32A_3P)18XC13_9XC19_combo_MI_38U-A__snmp-ups__2.7.4.1__01.dev
lrwx------ 1 root root 64 Jun 29 14:37 /proc/605/fd/5 -> socket:[2029782543]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/605/fd/6 -> socket:[2029812756]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/605/fd/7 -> socket:[2041964523]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/606/fd/0 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/606/fd/1 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/606/fd/2 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/606/fd/3 -> socket:[2029786199]
lr-x------ 1 root root 64 Jun 29 14:37 /proc/606/fd/4 -> /etc/nut/Eaton__EPDU_MI_38U-A__snmp-ups__2.7.4__01.dev
lrwx------ 1 root root 64 Jun 29 14:37 /proc/606/fd/5 -> socket:[2029786228]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/606/fd/6 -> socket:[2029812754]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/606/fd/7 -> socket:[2041964527]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/607/fd/0 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/607/fd/1 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/607/fd/2 -> /dev/null
lrwx------ 1 root root 64 Jun 29 14:37 /proc/607/fd/3 -> socket:[2029784179]
lr-x------ 1 root root 64 Jun 29 14:37 /proc/607/fd/4 -> /etc/nut/Eaton__EPDU_MI_0U_(309W_32A_3P)18XC13_9XC19__snmp-ups__2.7.4.1__01.dev
lrwx------ 1 root root 64 Jun 29 14:37 /proc/607/fd/5 -> socket:[2029784200]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/607/fd/6 -> socket:[2029812752]
lrwx------ 1 root root 64 Jun 29 14:37 /proc/607/fd/7 -> socket:[2041964592]
jimklimov commented 6 years ago

As another data point: despite all this, upsc DEVICENAME passes well, quickly and producing some reasonably looking key: value results.