tohojo / flent

The FLExible Network Tester.
https://flent.org
Other
428 stars 79 forks source link

Low throughput + NetEm delay creates gaps in upload data #265

Open upnix opened 2 years ago

upnix commented 2 years ago

The problem: In Mininet, when limiting link speed to 10Mbps (via TBF or NetEm) and adding any amount of delay with NetEm, Flent using Netperf+TCP_STREAM will return large gaps in upload data - both in CSV output and resulting charts. While Netperf acts strangely in this scenario (which I'll describe below), I believe it is Flent and the use of apply_to in the DATA_SETS data structure that causes this problem.

The setup:

With a network configuration of 1 router, 2 subnets, and 2 hosts (h1, h2), I use TBF to rate limit all links to 10Mbit/s, and NetEm to add ~28ms of delay between hosts (7ms on each link, but any amount of delay will do). I run Netserver on host h2, and the Flent test on h1, with traffic crossing the router. I'll attach my configuration files.

image

Commands:

$ sudo python3 ~/mininet_networks/1Router_2Networks_3Hosts.py
mininet> h2 pkill netserver
mininet> h2 netserver
mininet> h1 ethtool -K h1-eth0 tso off gso off gro off
mininet> h2 ethtool -K h2-eth0 tso off gso off gro off
mininet> h3 ethtool -K h3-eth0 tso off gso off gro off
mininet> r0 ethtool -K r0-eth1 tso off gso off gro off
mininet> r0 ethtool -K r0-eth2 tso off gso off gro off
mininet> r0 tc qdisc add dev r0-eth1 root tbf rate 10mbit burst 4096kbit latency 5ms
mininet> r0 tc qdisc add dev r0-eth2 root tbf rate 10mbit burst 4096kbit latency 5ms
mininet> r0 tc qdisc add dev r0-eth1 parent 8001: netem delay 7ms
mininet> r0 tc qdisc add dev r0-eth2 parent 8002: netem delay 7ms
mininet> h1 tc qdisc add dev h1-eth0 root tbf rate 10mbit burst 4096kbit latency 5ms
mininet> h1 tc qdisc add dev h1-eth0 parent 8005: netem delay 7ms
mininet> h2 tc qdisc add dev h2-eth0 root tbf rate 10mbit burst 4096kbit latency 5ms
mininet> h2 tc qdisc add dev h2-eth0 parent 8007: netem delay 7ms
mininet> h1 flent -H 10.0.0.100 -x --socket-stats -d 0 -l 60 tcp_2up -f csv -D ~chris/ -t 'TCP 2 Up ' -o ~chris/tcp_2up.csv

The result: There are large gaps in the results reported by Flent. image image

Narrowing the problem down Above, I showed the problem with the Flent-included tcp_2up test, but because I believe the issue lies with the use of apply_to I had to do some retooling of the test to exclude its use. So I have two new test configurations:

  1. tcp_nup_2.conf - This is the Flent-included tcp_nup.conf, modified by commenting out the function add_stream, the call to for_stream_config() and the DATA_SETS entry "TCP upload avg". I then hard-code in what is essentially a single "TCP upload::1" test.
  2. tcp_1up_from_nup_2.conf - This is tcp_2up.conf, but it includes tcp_nup_2.conf instead of tcp_nup.conf

Now, running the Flent test tcp_1up_from_nup_2.conf, upload data is shown as continuous, as you'd expect.

Why? I don't know. What I do know is that the Flent test tcp_2down has no problems, and when I run the related Netperf command directly, TCP_MAERTS will return results with with expected regularity (NETPERF_INTERVAL[xx]=0.2 more or less). However, the Netperf test TCP_STREAM, which tcp_2up uses will have spaces between results of 4 seconds (NETPERF_INTERVAL[xx]=4 more or less). The results returned still seem accurate to me, there's just longer pauses between reporting.

But this can't be the entire story, because Flent tests that don't use apply_to when building DATA_SETS use the exact same Netperf command, gaps and all, yet don't have this problem.

So it would seem to me that somehow Flent isn't properly handling gaps in reporting when apply_to is used for DATA_SETS.

What else fixes the problem?

Note that these are probably things that just make Netperf return results every 0.2 seconds (I haven't checked though), so they're probably not directly related to Flent.

Files of interest _Flent results when running the included tcp_2up test:_ tcp_2up-2022-04-22T095700.876743.TCP_2_Up.flent.gz

My Flent test that avoids gaps in upload data: tcp_1up_from_nup_2.txt tcp_nup_2.txt

The Mininet network used: 1Router_2Networks_3Hosts.txt

tohojo commented 2 years ago

Chris Cameron @.***> writes:

The problem: In Mininet, when limiting link speed to 10Mbps (via TBF or NetEm) and adding any amount of delay with NetEm, Flent using Netperf+TCP_STREAM will return large gaps in upload data - both in CSV output and resulting charts. While Netperf acts strangely in this scenario (which I'll describe below), I believe it is Flent and the use of apply_to in the DATA_SETS data structure that causes this problem.

So you're kinda right that the problem is caused by an interaction between netperf's behaviour and the Flent series computation (for certain series). Specifically, this is what happens:

The reason for the latter is the way Flent computes the synthetic data: it will try to generate a synthetic data point at every 'step size' interval, by linearly interpolating the points on both sides. E.g., if netperf outputs data points at t=0.198 and t=0.398, it'll interpolate between those to generate a synthetic data point at t=0.2. This will happen for each series, and the sum or average computation is done on those synthetic data points that are all aligned to the step size intervals.

The problem you're seeing happens because there's a maximum interpolation distance (of five times the step size), and if the data points are further apart than this, no interpolation will be done and you'll get gaps in the synthetic series.

Now, as for the question about what can be done about it, I'm afraid that (in my opinion) the answer turns out to be "not much". Because the fundamental problem here is that we're trying to compute a value that's not really well-defined, because we're dealing with a bunch of timeseries values.

I.e., as an example, if there are two instances of netperf running, series A outputs data points at t=1, 4, and 7 seconds, and series B outputs data points at t=3, 6 and 9 seconds, how are you really going to tell what the average throughput at t=2 seconds was?

(That's a serious question, BTW, if you have an idea for a better algorithm for interpolating data points, or just computing the synthetic series in a different way, I'm all ears).

As a workaround you could try increasing the step size; this should make the error in netperf's data output relatively smaller (since they tend to stay relatively constant in absolute values), which may help get rid of the gaps...