tohojo / flent

The FLExible Network Tester.
https://flent.org
Other
431 stars 77 forks source link

flent on openbsd loops, consuming available cpu/memory, after test completes #221

Closed sthen closed 3 years ago

sthen commented 3 years ago

Hi. flent test runs aren't completing on OpenBSD. I've tried with 1.2.2 on python 2.7.18 and 3.8.8, and with 2.0.0 on 3.8.8 and a few test types. netperf and fping are spawned, tests run, then at the end of the run one of the following happens:

python 2.7.x:

PingRunner Ping (ms) ICMP finished
Runner aggregation finished
Traceback (most recent call last):
  File "/usr/local/bin/flent", line 11, in <module>
    load_entry_point('flent==1.2.2', 'console_scripts', 'flent')()
  File "/usr/local/lib/python2.7/site-packages/flent/__init__.py", line 59, in run_flent
    b.run()
  File "/usr/local/lib/python2.7/site-packages/flent/batch.py", line 610, in run
    return self.run_test(self.settings, self.settings.DATA_DIR, True)
  File "/usr/local/lib/python2.7/site-packages/flent/batch.py", line 509, in run_test
    res = self.agg.postprocess(self.agg.aggregate(res))
  File "/usr/local/lib/python2.7/site-packages/flent/aggregators.py", line 261, in aggregate
    for s in range(steps):
MemoryError
TimerRunner Watchdog [Ping (ms) ICMP] finished

python 3.x: flent process loops consuming cpu and continually allocating memory. If I run under gdb and point py-bt at it I get a similar trace to 2.7's after the MemoryError.

(gdb) run /usr/local/bin/flent -v -H 127.0.0.1 -l 10 tcp_ndown --te=download_streams=4
Starting program: /usr/local/bin/python3.8 /usr/local/bin/flent -v -H 127.0.0.1 -l 10 tcp_ndown --te=download_streams=4
[New process 39572]
Executing test environment file /usr/local/lib/python3.8/site-packages/flent/tests/tcp_ndown.conf
Started Flent 2.0.0 using Python 3.8.8.
Looking up hostname '127.0.0.1'.
Executing test environment file /usr/local/lib/python3.8/site-packages/flent/tests/tcp_ndown.conf
Gathering local metadata
Looking up hostname '127.0.0.1'.
Starting tcp_ndown test. Expected run time: 20 seconds.
which: /home/sthen/bin/fping is not an executable file
which: /bin/fping is not an executable file
which: /sbin/fping is not an executable file
which: /usr/bin/fping is not an executable file
which: /usr/sbin/fping is not an executable file
which: /usr/X11R6/bin/fping is not an executable file
which: /usr/local/bin/fping is not an executable file
which: Found fping executable at /usr/local/sbin/fping
which: /home/sthen/bin/ping is not an executable file
which: /bin/ping is not an executable file
which: Found ping executable at /sbin/ping
which: /home/sthen/bin/netperf is not an executable file
which: /bin/netperf is not an executable file
which: /sbin/netperf is not an executable file
which: /usr/bin/netperf is not an executable file
which: /usr/sbin/netperf is not an executable file
which: /usr/X11R6/bin/netperf is not an executable file
which: Found netperf executable at /usr/local/bin/netperf
which: /home/sthen/bin/netperf is not an executable file
which: /bin/netperf is not an executable file
which: /sbin/netperf is not an executable file
which: /usr/bin/netperf is not an executable file
which: /usr/sbin/netperf is not an executable file
which: /usr/X11R6/bin/netperf is not an executable file
which: Found netperf executable at /usr/local/bin/netperf
which: /home/sthen/bin/netperf is not an executable file
which: /bin/netperf is not an executable file
which: /sbin/netperf is not an executable file
which: /usr/bin/netperf is not an executable file
which: /usr/sbin/netperf is not an executable file
which: /usr/X11R6/bin/netperf is not an executable file
which: Found netperf executable at /usr/local/bin/netperf
which: /home/sthen/bin/netperf is not an executable file
which: /bin/netperf is not an executable file
which: /sbin/netperf is not an executable file
which: /usr/bin/netperf is not an executable file
which: /usr/sbin/netperf is not an executable file
which: /usr/X11R6/bin/netperf is not an executable file
which: Found netperf executable at /usr/local/bin/netperf
Forking to run command /usr/local/sbin/fping  -D -p 200 -c 100 -t 40000   127.0.0.1
Forked /usr/local/sbin/fping as pid 36312
Ping (ms) ICMP: Starting watchdog with timeout 21
Started TimerRunner idx 2 ('Watchdog [Ping (ms) ICMP]')
Started PingRunner idx 2 ('Ping (ms) ICMP')
Forking to run command /usr/local/bin/netperf -P 0 -v 0 -D -0.20 -4 -Y 0x0 -H 127.0.0.1 -p 12865 -t TCP_MAERTS -l 10 -F /dev/urandom -f m   --    -H 127.0.0.1 -k THROUGHPUT,LOCAL_CONG_CONTROL,REMOTE_CONG_CONTROL,TRANSPORT_MSS,LOCAL_TRANSPORT_RETRANS,REMOTE_TRANSPORT_RETRANS,LOCAL_SOCKET_TOS,REMOTE_SOCKET_TOS,DIRECTION,ELAPSED_TIME,PROTOCOL,LOCAL_SEND_SIZE,LOCAL_RECV_SIZE,REMOTE_SEND_SIZE,REMOTE_RECV_SIZE,LOCAL_BYTES_SENT,LOCAL_BYTES_RECVD,REMOTE_BYTES_SENT,REMOTE_BYTES_RECVD  
Forked /usr/local/bin/netperf as pid 69806
TCP download::1: Starting watchdog with timeout 25
Started TimerRunner idx 3 ('Watchdog [TCP download::1]')
Started NetperfDemoRunner idx 3 ('TCP download::1')
Forking to run command /usr/local/bin/netperf -P 0 -v 0 -D -0.20 -4 -Y 0x0 -H 127.0.0.1 -p 12865 -t TCP_MAERTS -l 10 -F /dev/urandom -f m   --    -H 127.0.0.1 -k THROUGHPUT,LOCAL_CONG_CONTROL,REMOTE_CONG_CONTROL,TRANSPORT_MSS,LOCAL_TRANSPORT_RETRANS,REMOTE_TRANSPORT_RETRANS,LOCAL_SOCKET_TOS,REMOTE_SOCKET_TOS,DIRECTION,ELAPSED_TIME,PROTOCOL,LOCAL_SEND_SIZE,LOCAL_RECV_SIZE,REMOTE_SEND_SIZE,REMOTE_RECV_SIZE,LOCAL_BYTES_SENT,LOCAL_BYTES_RECVD,REMOTE_BYTES_SENT,REMOTE_BYTES_RECVD  
Forked /usr/local/bin/netperf as pid 46880
TCP download::2: Starting watchdog with timeout 25
Started TimerRunner idx 4 ('Watchdog [TCP download::2]')
Started NetperfDemoRunner idx 4 ('TCP download::2')
Forking to run command /usr/local/bin/netperf -P 0 -v 0 -D -0.20 -4 -Y 0x0 -H 127.0.0.1 -p 12865 -t TCP_MAERTS -l 10 -F /dev/urandom -f m   --    -H 127.0.0.1 -k THROUGHPUT,LOCAL_CONG_CONTROL,REMOTE_CONG_CONTROL,TRANSPORT_MSS,LOCAL_TRANSPORT_RETRANS,REMOTE_TRANSPORT_RETRANS,LOCAL_SOCKET_TOS,REMOTE_SOCKET_TOS,DIRECTION,ELAPSED_TIME,PROTOCOL,LOCAL_SEND_SIZE,LOCAL_RECV_SIZE,REMOTE_SEND_SIZE,REMOTE_RECV_SIZE,LOCAL_BYTES_SENT,LOCAL_BYTES_RECVD,REMOTE_BYTES_SENT,REMOTE_BYTES_RECVD  
Forked /usr/local/bin/netperf as pid 79870
TCP download::3: Starting watchdog with timeout 25
Started TimerRunner idx 5 ('Watchdog [TCP download::3]')
Started NetperfDemoRunner idx 5 ('TCP download::3')
Forking to run command /usr/local/bin/netperf -P 0 -v 0 -D -0.20 -4 -Y 0x0 -H 127.0.0.1 -p 12865 -t TCP_MAERTS -l 10 -F /dev/urandom -f m   --    -H 127.0.0.1 -k THROUGHPUT,LOCAL_CONG_CONTROL,REMOTE_CONG_CONTROL,TRANSPORT_MSS,LOCAL_TRANSPORT_RETRANS,REMOTE_TRANSPORT_RETRANS,LOCAL_SOCKET_TOS,REMOTE_SOCKET_TOS,DIRECTION,ELAPSED_TIME,PROTOCOL,LOCAL_SEND_SIZE,LOCAL_RECV_SIZE,REMOTE_SEND_SIZE,REMOTE_RECV_SIZE,LOCAL_BYTES_SENT,LOCAL_BYTES_RECVD,REMOTE_BYTES_SENT,REMOTE_BYTES_RECVD  
Forked /usr/local/bin/netperf as pid 46779
TCP download::4: Starting watchdog with timeout 25
Started TimerRunner idx 6 ('Watchdog [TCP download::4]')
Started NetperfDemoRunner idx 6 ('TCP download::4')
NetperfDemoRunner TCP download::2 finished
NetperfDemoRunner TCP download::1 finished
NetperfDemoRunner TCP download::3 finished
NetperfDemoRunner TCP download::4 finished
[New thread 395121]
PingRunner Ping (ms) ICMP finished
Runner aggregation finished
TimerRunner Watchdog [Ping (ms) ICMP]: timer expired
TimerRunner Watchdog [Ping (ms) ICMP] finished
TimerRunner Watchdog [TCP download::2]: timer expired
TimerRunner Watchdog [TCP download::2] finished
TimerRunner Watchdog [TCP download::1]: timer expired
TimerRunner Watchdog [TCP download::4]: timer expired
TimerRunner Watchdog [TCP download::1] finished
TimerRunner Watchdog [TCP download::3]: timer expired
TimerRunner Watchdog [TCP download::4] finished
TimerRunner Watchdog [TCP download::3] finished
^C
Thread 1 received signal SIGINT, Interrupt.
0x00000396ee51daba in float_sub (v=<float at remote 0x39641988a30>, w=<float at remote 0x396fd627590>) at Objects/floatobject.c:559
559     CONVERT_TO_DOUBLE(v, a);
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/local/lib/python3.8/site-packages/flent/aggregators.py", line 808, in aggregate
  File "/usr/local/lib/python3.8/site-packages/flent/batch.py", line 1027, in run_test
  File "/usr/local/lib/python3.8/site-packages/flent/batch.py", line 1897, in run
  File "/usr/local/lib/python3.8/site-packages/flent/__init__.py", line 59, in run_flent
    b.run()
  File "/usr/local/bin/flent", line 11, in <module>
    load_entry_point('flent==2.0.0', 'console_scripts', 'flent')()

Do you have any ideas as to what might be happening please? The netperf and fping processes definitely exit before the "TimerRunner Watchdog [xxx]: timer expired" messages are displayed (confirmed a few times while running pstree at various points). I'm not all that proficient in python but maybe able to poke at some things if you have any ideas of what to try.

Flent has worked on OpenBSD in the past but it's been a couple of years since I previously tried it, so couldn't say when this started. I'm running recent OpenBSD -current on amd64, netperf is v2.7.0 and fping is v5.0.

tohojo commented 3 years ago

Hmm, I've had a report of something that kinda sounds similar on OSX. Maybe they are related? Could you please try the latest git version? That adds a bit of extra debugging, which may help me correlate the two issues.

Also, something seems to be off with that line number in the last trace (there is no line 808 in aggregators.py) - any chance you can figure out the real place it's spinning?

How much memory is it allocating before it crashes / before you kill it, and how long does it spin for?

sthen commented 3 years ago

It spins for as long as I let it run or until it bumps into ulimit, RES grows at about 16M per 10 seconds.

There's not much difference in output with git head, apart from outputting Executing lines when gathering information at startup the only differences in output are that it doesn't print "[New thread 395121]" after "NetperfDemoRunner TCP download::4 finished" and timers expire in a different order, here's the diff

 NetperfDemoRunner TCP download::1 finished
 NetperfDemoRunner TCP download::3 finished
 NetperfDemoRunner TCP download::4 finished
-[New thread 395121]
 PingRunner Ping (ms) ICMP finished
 Runner aggregation finished
 TimerRunner Watchdog [Ping (ms) ICMP]: timer expired
 TimerRunner Watchdog [Ping (ms) ICMP] finished
 TimerRunner Watchdog [TCP download::2]: timer expired
 TimerRunner Watchdog [TCP download::2] finished
-TimerRunner Watchdog [TCP download::1]: timer expired
-TimerRunner Watchdog [TCP download::4]: timer expired
-TimerRunner Watchdog [TCP download::1] finished
 TimerRunner Watchdog [TCP download::3]: timer expired
-TimerRunner Watchdog [TCP download::4] finished
 TimerRunner Watchdog [TCP download::3] finished
+TimerRunner Watchdog [TCP download::4]: timer expired
+TimerRunner Watchdog [TCP download::4] finished
+TimerRunner Watchdog [TCP download::1]: timer expired
+TimerRunner Watchdog [TCP download::1] finished

I'll add some prints in aggregators.py to try to figure out where it's spinning and get back to you.

sthen commented 3 years ago

Oh, well the python 2 trace gives a pretty good idea where it's spinning actually. Printing t_0/t_max/steps before the loop, something is clearly wrong with t_0...

t_0 581513.01849 t_max 1615131003.179 steps 8072747451
sthen commented 3 years ago

"the loop" = "for s in range(steps)" in TimeseriesAggregator aggregate().

sthen commented 3 years ago

...so, fping is buggy. If SO_TIMESTAMPNS isn't available on the OS then it tries to fetch time from CLOCK_MONOTONIC and use it as realtime for -D.

$ fping -D -c1 127.0.0.1 2>/dev/null         
[594271.59597] 127.0.0.1 : [0], 64 bytes, 0.079 ms (0.079 avg, 0% loss)

Having repaired that, flent is happy again. Hopefully this is the same issue on OSX.

Reported here: https://github.com/schweikert/fping/issues/216

tohojo commented 3 years ago

Ah, nice find, thanks! Still, Flent should probably detect this and bail - going to reopen this issue to track that...