Runner scalability - Githubissues

tohojo / flent

The FLExible Network Tester.

https://flent.org

Other

431 stars 77 forks source link

Runner scalability #257

Closed tohojo closed 2 years ago

tohojo commented 2 years ago

This pull request contains a number of fixes that significantly improves the scaling of Flent when spawning a lot of runners:

Fix the SsRunner to only parse output once when running multiple duplicate runners (i.e., for multiple flows on the same host).
Move parsing of runner output into separate subprocesses to let it run in parallel without the Python GIL
Fix a bunch of inefficiencies in the way Flent forks off runner child processes and the Python threads that monitor them
Fix the behaviour when a test is interrupted so runners are more reliably shut down in a timely manner
Improve memory usage by only reading the tool output into memory when it is absolutely needed, and keeping it there for as short a time as possible
- in particular, turn all parsers into stream-based parsers
- also fix garbage collection of runners after a test is done so data isn't kept in memory

The individual commits contain the details; along with the main changes listed above are various smaller fixes that turned out to be useful along the way.

With these changes it is quite feasible to run a tcp_nup test with 1000 flows on my laptop, at least as far as starting the netperf instances is concerned (whether the network can actually handle it is a different matter :) ).

dtaht commented 2 years ago

At one level I applaud. At another I kind of wish we were extracting more, directly from TCP_INFO. Do we really need ss?

tohojo commented 2 years ago

Dave Täht @.***> writes:

At one level I applaud. At another I kind of wish we were extracting more, directly from TCP_INFO. Do we really need ss?

Alternatives welcome, especially if they come with patches :)

dtaht commented 2 years ago

Tell ya what. I'll go back to coding, if you get back into politics.

tohojo commented 2 years ago

@dtaht care to take this for a spin?

As for your question about TCP_INFO, it looks like it should be feasible to integrate this as an alternative to 'ss': https://github.com/m-lab/tcp-info

dtaht commented 2 years ago

tomorrow. pst

dtaht commented 2 years ago

Oh, my aching fingers and pre-existing test scripts that did --te

flent: error: ambiguous option: --te=upload_streams=1 could match --test-payload, --test-parameter

dtaht commented 2 years ago

flent -H fremont.starlink.taht.net --socket-stats --step-size=.02 --test-parameter=download_streams=1000 -t cell-tether-1000 tcp_ndown

ERROR: Resource limit of 1024 files is too low - need at least 4012 for this test

A steer for a naive user to ulimit -n 5096 or calling it directly would be good for a naive user. ulimit is not easily discoverable.

dtaht commented 2 years ago

flent -H fremont.starlink.taht.net --socket-stats --step-size=.02 --test-parameter=download_streams=1000 -t cell-tether-1000 rrul Starting Flent 2.0.1-git-c78dac1 using Python 3.8.10. Starting rrul test. Expected run time: 70 seconds. Exception in thread Thread-13: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.8/dist-packages/flent-2.0.1_git_c78dac1-py3.8.egg/flent/runners.py", line 523, in run pid, sts = os.waitpid(self.pid, os.WNOHANG) TypeError: an integer is required (got type NoneType) Exception in thread Thread-14: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/usr/local/lib/python3.8/dist-packages/flent-2.0.1_git_c78dac1-py3.8.egg/flent/runners.py", line 523, in run pid, sts = os.waitpid(self.pid, os.WNOHANG) TypeError: an integer is required (got type NoneType) Exception in thread Thread-15: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(self._args, self._kwargs) File "/usr/local/lib/python3.8/dist-packages/flent-2.0.1_git_c78dac1-py3.8.egg/flent/runners.py", line 523, in run pid, sts = os.waitpid(self.pid, os.WNOHANG) TypeError: an integer is required (got type NoneType)

dtaht commented 2 years ago

The rtt_fair test works. Don't know why the rrul test doesn't.

tohojo commented 2 years ago

Well because there was a bug, obviously ;)

Should be fixed now, and also improved the rlimit handling so it tries to raise it automatically and hints at ulimit if that fails...

dtaht commented 2 years ago

WFM. But doesn't your test suite exercise all the known tests? I know it would take a long time to complete, but...

tohojo commented 2 years ago

Nope, never did get around to having the test suite actually run the tests; it only exercises the plotters and some of the parsers (which did unearth another bug, so not completely useless).

Thanks for testing! :)

dtaht commented 2 years ago

A pleasure to fiddle with this stuff again with you.