Open arlake228 opened 7 months ago
pSConfig would run again and notice we had these background tasks that weren't producing any results (based on the run count in pscheduler api), send a cancel and then create a new task.
The run count is the number of runs completed, so it would make sense that a long-running latencybg
would show zero until it completes. The count of runs started is in the details as runs-started
. Using that might get better results.
Stopping runs underway is covered by #129. It will require some work; I'm not sure how much yet. I'll bring that into 5.2.0 and could be talked into making a 5.1.x bugfix.
Hi @arlake228
I've also run into a similar issue, particularly with latencybg
test and powstream
memory usage which leads to oomkiller being invoked on other services such as opensearch, thus causing system instability on perfSONAR (v5.1.2) hosts with 16G RAM.
At least from my observations, the pscheduler-runner
systemd unit invokes /usr/libexec/pscheduler/daemons/runner
, which then invokes usr/libexec/pscheduler/classes/tool/powstream/run
(and subsequently invokes powstream
binary to run tests). At some point, powstream
binaries are left running and new tests are invoked.
In my case, here's what it looks like over time on a single host within a mesh of 16 servers over 2 days:
This is within the last 24 hour period:
Noting that all 16 servers show similar issues, some are worse than others.
On the specific server in question:
[root@somefancyhost ~]# netstat -nalp | grep powstream | grep tcp | wc -l
210
[root@somefancyhost ~]# netstat -nalp | grep powstream | grep udp | wc -l
137
[root@somefancyhost ~]# ps aux | grep powstream | wc -l
277
On the oldest running powstream
pid:
[root@somefancyhost ~]# lsof -np 440
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
powstream 440 pscheduler cwd DIR 253,0 279 128 /
powstream 440 pscheduler rtd DIR 253,0 279 128 /
powstream 440 pscheduler txt REG 253,0 237304 17034374 /usr/bin/powstream
powstream 440 pscheduler mem REG 253,0 636848 33978986 /usr/lib64/libpcre2-8.so.0.11.0
powstream 440 pscheduler mem REG 253,0 175760 33978989 /usr/lib64/libselinux.so.1
powstream 440 pscheduler mem REG 253,0 2592552 34645670 /usr/lib64/libc.so.6
powstream 440 pscheduler mem REG 253,0 108136 33685639 /usr/lib64/libgcc_s-11-20231218.so.1
powstream 440 pscheduler mem REG 253,0 914376 34645673 /usr/lib64/libm.so.6
powstream 440 pscheduler mem REG 253,0 149216 33839734 /usr/lib64/libnss_resolve.so.2
powstream 440 pscheduler mem REG 253,0 36304 33793754 /usr/lib64/libcap.so.2.48
powstream 440 pscheduler mem REG 253,0 858936 34645666 /usr/lib64/ld-linux-x86-64.so.2
powstream 440 pscheduler 0r FIFO 0,13 0t0 179512011 pipe
powstream 440 pscheduler 1w FIFO 0,13 0t0 179512012 pipe
powstream 440 pscheduler 2w FIFO 0,13 0t0 179512013 pipe
powstream 440 pscheduler 3uW REG 253,5 4 50331821 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/.powlock
powstream 440 pscheduler 4r CHR 1,9 0t0 9 /dev/urandom
powstream 440 pscheduler 5u IPv4 193866091 0t0 TCP <removed src ip>:59345-><removed dst ip>:owamp-control (ESTABLISHED)
powstream 440 pscheduler 6u REG 253,5 0 50331818 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/pow.ssTeYq (deleted)
powstream 440 pscheduler 7u REG 253,5 0 50331818 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/pow.ssTeYq (deleted)
powstream 440 pscheduler 8u IPv4 193875111 0t0 TCP <removed src ip>:41175-><removed dst ip>:owamp-control (ESTABLISHED)
powstream 440 pscheduler 9u REG 253,5 0 50331924 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/pow.VqBN5u (deleted)
powstream 440 pscheduler 10u REG 253,5 0 50331924 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/pow.VqBN5u (deleted)
powstream 440 pscheduler 12u REG 253,6 0 170 /tmp/owpskips.zNYTA8 (deleted)
powstream 440 pscheduler 13u REG 253,6 0 601 /tmp/owpskips.NE7yDv (deleted)
strace
on the same pid (440) indicates that it's alive and well (I can provide a 120s sample if required). Random sampling of other powstream
pids running look similar.
I also see the same issues within our perfSONAR (v5.1.2) dev env (which is a smaller mesh of 3 servers within the same site).
In addition, I was chatting with another perfSONAR server operator who mentioned to me that they don't use latencybg
tests due to it being problematic.
FWIW, I had initially thought the issue was being caused by owstats
(as mentioned in this issue which I had created), however I don't think that's the case (red herring?). Although it did slow the OOM events down after swapping from chronyd
back to ntpd
.
I ran into a case in 5.1.0 where I built a mesh where some of the far-end hosts had not come up yet. Since latencybg tests are single-particpant, the tasks were still created and powstream is started. Since there is nothing on the far end, powstream would never yield any results.
pSConfig would run again and notice we had these background tasks that weren't producing any results (based on the run count in pscheduler api), send a cancel and then create a new task. Example of such a task below:
After canceling the task, pSConfig would create a new task. The problem is that the "cancel" operation doesn't kill the powstream. This led to a build up of powstreams (and quite a bit of memory usage), as new ones got created every hour for tasks in this category. I know we have discussed this before but was having trouble finding the issue for it. I think pSConfig is doing the right thing with the info it has, but it'd be nice to kill the underlying background process on cancel.