completion delay for an iperf script

tohojo / flent

The FLExible Network Tester.

https://flent.org

Other

430 stars 77 forks source link

completion delay for an iperf script #138

Closed teto closed 6 years ago

teto commented 6 years ago

I would like to send a fixed amount of data (I am more used to iperf) so for instance run 10 times iperf -c HOST -n 6MB and show a cdf of the completion times.

I kinda understand the .conf format and came up with my own script iperf_cdf.conf

DESCRIPTION="iperf completion delay 6MB"
DEFAULTS={'PLOT': "iperf_delay",
          'HOSTS': []}

IPERF_V6=""
if IP_VERSION == 6:
    IPERF_V6="-V"

DATA_SETS = o([
    ('TCP iperf',
        # -n => fixed number of byte
         {'command': "iperf -c %s -i %.2f -y C -n 6M %s" % (HOST, max(0.5,STEP_SIZE), IPERF_V6),
          'delay': DELAY,               # do I need this anymore ?
          'units': 'second',
          'runner': 'iperf_csv',})
    ])

PLOTS['iperf_delay']       = {'description': 'Iperf completion time',
                             'type': 'timeseries',
                             'series': [
                                 {'data': 'TCP iperf',
                                      'label': 'completion delay (ms)'}]}
    ],

My problem is on how to retrieve the actual data, i.e. the delay taken by the iperf connection. I see that the runners have a parse function and a self._raw_values

Just for info here is the output.

iperf -n 6M -c localhost --enhancedreports
------------------------------------------------------------
Client connecting to localhost, TCP port 5001 with pid 20613
Write buffer size:  128 KByte
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 127.0.0.1 port 46618 connected with 127.0.0.1 port 5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd/RTT
[  3] 0.00-0.23 sec  6.00 MBytes   221 Mbits/sec  1/0      65483        0K/56811 us

NB: can I load this iperf_cdf.conf out of tree ?

teto commented 6 years ago

Here is my current attempt https://github.com/tohojo/flent/compare/master...teto:iperf_delay?expand=1 with a completely wrong result: cdf The x unit is not a CDF, I am a tad lost I confess.

tohojo commented 6 years ago

Matthieu Coudron notifications@github.com writes:

I would like to send a fixed amount of data (I am more used to iperf) so for instance run 10 times iperf -c HOST -n 6MB and show a cdf of the completion times.

Hmm, so what is it that you are trying to measure here? This doesn't sound like it's a timeseries data series? You just want to repeat a single download and measure its time, or are you planning to run another bulk flow or latency measurement at the same time.

I kinda understand the .conf format and came up with my own script iperf_cdf.conf


DESCRIPTION="iperf completion delay 6MB"
DEFAULTS={'PLOT': "iperf_delay",
          'HOSTS': []}

IPERF_V6=""
if IP_VERSION == 6:
    IPERF_V6="-V"

DATA_SETS = o([
    ('TCP iperf',
        # -n => fixed number of byte
         {'command': "iperf -c %s -i %.2f -y C -n 6M %s" % (HOST, max(0.5,STEP_SIZE), IPERF_V6),
          'delay': DELAY,               # do I need this anymore ?
          'units': 'second',
          'runner': 'iperf_csv',})
    ])

This will get you the iperf CSV output (i.e., bandwidth over time).

Just for info here is the output.

iperf -n 6M -c localhost --enhancedreports
------------------------------------------------------------
Client connecting to localhost, TCP port 5001 with pid 20613
Write buffer size:  128 KByte
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 127.0.0.1 port 46618 connected with 127.0.0.1 port 5001
[ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry    Cwnd/RTT
[  3] 0.00-0.23 sec  6.00 MBytes   221 Mbits/sec  1/0      65483        0K/56811 us

Yeah, there's no parser for this currently. But could probably be added, either to the existing iperf parser, or another runner. Depends a bit on what data you want; see my question above.

NB: can I load this iperf_cdf.conf out of tree ?

No. You can drop it in the tests folder, of course, but there is no support for loading tests from an outside directory. Having this would imply that the test format is API, which it isn't. So I'd much rather include the test upstream; should be possible to parameterise it so it is generally useful :)

teto commented 6 years ago

You just want to repeat a single download and measure its time, Yes that's exactly it.

What I don't get is the mapping between the DATA_SET items and the python code. I get that for the PLOTS[XXX], you should specify in 'series' the name of a DATA_SETS entry but I haven't seen yet how the runner exports its parsed values to the DATA_SETS. In this example what decides the kind of data that 'TCP iperf' represents among all the values returned by iperf output ?

The last line of iperf output being a recap of the full download is the reason for this addition https://github.com/tohojo/flent/compare/master...teto:iperf_delay?expand=1#diff-9257f1ed49fd478f9b31d21c2b63a4b6R1325 (e.g., substract the first timestamp to the last one). When cleaning up the code I would move it a bit later since here it is apparently recomputed for each new line.

To sum up 1/ how can I get 'TCP iperf' to represent the duration of one download ? 2/ how to run several iperf run in order to plot the cdf of the previous result.

Thanks for the help. I am very impressed by the tool so far (I've been writing a tool to compute one-way delays and Multipath TCP statistics so I know it can be tricky).

tohojo commented 6 years ago

Matthieu Coudron notifications@github.com writes:

You just want to repeat a single download and measure its time, Yes that's exactly it.

Right. Well, there are basically two ways you can go about this. Either you create a test that runs only a single test, capture the delay as a metadata item, and then the test multiple times to collect multiple data points (which will each go into its own data file), and plot the CDF over all of them (using a cdf_combine type plot). Or you go play with IterationAggregator.

The combining of different runners into the toplevel data set is done by aggregators; most of the tests are returning timeseries data, i.e. where each runner is expected to return a timeseries data series. Single data points for a whole test run is stored as metadata (in the series_meta array).

However, there's also some old code that does single-data-point-per-run. Before I realised that timeseries data was the most useful for my tests, I actually started out with exactly what you are describing: Using each test as a single data point and plotting those. That code still exists in the form of the IterationAggregator; but you'd have to write a new runner that outputs a single data point instead of the timeseries data, and obviously you wouldn't get the bandwidth output over the duration of the test.

I think that the easiest approach is the first one, since the timeseries data is what's used most, and the other code might have bitrotted (not sure). If you don't mind using netperf, it's just a matter of adding ELAPSEDTIME as an output var and storing that in the metadata array. Then we'd need to add support for passing negative durations to netperf (that's how you specify a byte count), which is mostly a question of how to have a good API in Flent. With this, you could use the existing tcp* tests and just add a plot to those...

You could do something similar in the iperf runner as well, of course, it would just take a bit more parsing I think :)

teto commented 6 years ago

so I managed to pass a negative length but it fails with

NetperfDemoRunner TCP netperf::5 finished
Runner aggregation finished
Data file written to ./tcp_iperf_delay-2018-06-08T154503.539961.flent.gz.
Creating new PlotFormatter
CACHEDIR=/home/teto/.cache/matplotlib
Using fontManager instance from /home/teto/.cache/matplotlib/fontList.json
backend GTK3Cairo version unknown
Initialised matplotlib v2.2.2 on numpy v1.14.2.
Traceback (most recent call last):
  File "/run/user/1000/tmp.84zoa9EUe9/bin/flent", line 11, in <module>
    load_entry_point('flent', 'console_scripts', 'flent')()
  File "/home/teto/flent/flent/__init__.py", line 59, in run_flent
    b.run()
  File "/home/teto/flent/flent/batch.py", line 617, in run
    return self.run_test(self.settings, self.settings.DATA_DIR, True)
  File "/home/teto/flent/flent/batch.py", line 525, in run_test
    formatter.format([res])
  File "/home/teto/flent/flent/formatters.py", line 396, in format
    self.init_plots(results)
  File "/home/teto/flent/flent/formatters.py", line 381, in init_plots
    self.plotter.init()
  File "/home/teto/flent/flent/plotters.py", line 1814, in init
    s_unit = self.data_config[s['data']]['units']
KeyError: 'TCP netperf'

on https://github.com/tohojo/flent/compare/master...teto:iperf_delay?expand=1

Ideally I would like to run several times one tcp transfer but I thought running several at once would allow me to test the cdf_combine faster.

tohojo commented 6 years ago

Matthieu Coudron notifications@github.com writes:

so I managed to pass a negative length but it fails with

NetperfDemoRunner TCP netperf::5 finished
Runner aggregation finished
Data file written to ./tcp_iperf_delay-2018-06-08T154503.539961.flent.gz.
Creating new PlotFormatter
CACHEDIR=/home/teto/.cache/matplotlib
Using fontManager instance from /home/teto/.cache/matplotlib/fontList.json
backend GTK3Cairo version unknown
Initialised matplotlib v2.2.2 on numpy v1.14.2.
Traceback (most recent call last):
  File "/run/user/1000/tmp.84zoa9EUe9/bin/flent", line 11, in <module>
    load_entry_point('flent', 'console_scripts', 'flent')()
  File "/home/teto/flent/flent/__init__.py", line 59, in run_flent
    b.run()
  File "/home/teto/flent/flent/batch.py", line 617, in run
    return self.run_test(self.settings, self.settings.DATA_DIR, True)
  File "/home/teto/flent/flent/batch.py", line 525, in run_test
    formatter.format([res])
  File "/home/teto/flent/flent/formatters.py", line 396, in format
    self.init_plots(results)
  File "/home/teto/flent/flent/formatters.py", line 381, in init_plots
    self.plotter.init()
  File "/home/teto/flent/flent/plotters.py", line 1814, in init
    s_unit = self.data_config[s['data']]['units']
KeyError: 'TCP netperf'

on https://github.com/tohojo/flent/compare/master...teto:iperf_delay?expand=1

Ideally I would like to run several times one tcp transfer but I thought running several at once would allow me to test the cdf_combine faster.

Ah. Using 'duplicates' changes the name (adds ::1, ::2, etc). You can use a glob in the plot definition to grab all of them.

To run the test multiple times, you'd just repeat the Flent invocation (or use the batch facility and the 'repetitions' keyword). That will produce a bunch of data files that you then plot at once in a separate plot definition.

tohojo commented 6 years ago

With the latest commits you shouldn't need to patch runners.py at all; just pass the number of bytes (positive number) as the 'bytes' argument to the netperf runner.

teto commented 6 years ago

I used to have

Runner aggregation finished
ERROR: No data to aggregate. Run with -L and check log file to investigate.

with in log

ELAPSED_TIME=0.00
...
-- OUTPUT START -->{'TCP netperf::1': [],
 'TCP netperf::2': [],
 'TCP netperf::3': [],
 'TCP netperf::4': [],
 'TCP netperf::5': []}<-- OUTPUT END --

as if ELAPSED_TIME being 0.00 wouldn't count as a result.

I increased the byte count and it started producing result

2018-06-11 11:38:01,160 [flent.aggregators] DEBUG: Runner aggregation finished
-- OUTPUT START -->{'TCP netperf::1': [[1528684679.972, 41611.46]],
 'TCP netperf::2': [[1528684679.99, 45512.46]],
 'TCP netperf::3': [[1528684680.044, 45546.14]],
 'TCP netperf::4': [[1528684680.008, 46227.74]],
 'TCP netperf::5': [[1528684680.026, 46214.39]]}<-- OUTPUT END --

but these don't seem to be the ELAPSED_TIME.

I've tried using this

DATA_SETS = o([
    ('TCP netperf',
         {
        'test': 'TCP_STREAM',
        'host': HOST,
         'length': 10,
        'bytes': 100000000,
          'duplicates': 5,
            # 'units': 'Mbits/s',
          'runner': 'netperf_demo',})
    ])

PLOTS['iperf_delay']    = {'description': 'Netperf completion time',
                             'type': 'cdf_combine',
                            'group_by': 'groups_concat',
                             'series': [
                                # when duplicates is used the name is changed to e.g.:
                                # Started NetperfDemoRunner idx 4 ('TCP netperf::5')
                                # thus we need to use the glob

                                 {'data': glob('TCP netperf*'),
                                    'label': 'completion delay (ms)',
                                    'combine_mode': 'meta:ELAPSED_TIME'
                                 }]
                            }

and it displays an empty plot. when opening the tcp_iperf_delay-2018-06-07T143147.599875.flent it seems like all elapsed time were 0.02. I wanted to modify these values in place and regenerate a plot from the modified tcp_iperf_delay-2018-06-07T143147.599875.flent but couldn't find a way out.

How can I get

 'TCP netperf::2': <ElapsedTime2>,
 'TCP netperf::3': <ElapsedTime3>,
 'TCP netperf::4': <ElapsedTime4><-- OUTPUT END --

and then plot the cdf of it ? maybe that's not how it works, and I shouldn't care about the 'TCP netperf::2': [] data since ELAPSED_TIME is metadata ? it really looks like magic as it's kind of DSL mixed with python.

tohojo commented 6 years ago

Matthieu Coudron notifications@github.com writes:

I used to have

Runner aggregation finished
ERROR: No data to aggregate. Run with -L and check log file to investigate.

with in log


ELAPSED_TIME=0.00

Yeah, this is an issue with netperf; it'll only output time in 10-ms intervals. So you'll need to have enough data for the transfer to take longer than this.

The 'no data to aggregate' is because there are no intermediate data points output by netperf. For very short transfers, you could fiddle the step size parameter to get output for smaller intervals.

If you need to run tests that large, I guess the 'no data to aggregate' error could be made non-fatal in the case where there are valid metadata results. But I assume your 'real' tests are going to be on lower bandwidth links so they'll take longer (and thus this will be less of an issue)?

... -- OUTPUT START -->{'TCP netperf::1': [], 'TCP netperf::2': [], 'TCP netperf::3': [], 'TCP netperf::4': [], 'TCP netperf::5': []}<-- OUTPUT END --
as if ELAPSED_TIME being 0.00 wouldn't count as a result.

I increased the byte count and it started producing result
2018-06-11 11:38:01,160 [flent.aggregators] DEBUG: Runner aggregation finished -- OUTPUT START -->{'TCP netperf::1': [[1528684679.972, 41611.46]], 'TCP netperf::2': [[1528684679.99, 45512.46]], 'TCP netperf::3': [[1528684680.044, 45546.14]], 'TCP netperf::4': [[1528684680.008, 46227.74]], 'TCP netperf::5': [[1528684680.026, 46214.39]]}<-- OUTPUT END --
but these don't seem to be the ELAPSED_TIME.

Yeah, so this is because we are still using the timeseries run mode, so the actual data points being produced are the bandwidth measurements that netperf produce. So the elapsed time is stored as metadata, and we basically ignore the other results.

I've tried using this

DATA_SETS = o([
    ('TCP netperf',
         {
      'test': 'TCP_STREAM',
      'host': HOST,
       'length': 10,
      'bytes': 100000000,
        'duplicates': 5,
          # 'units': 'Mbits/s',
          'runner': 'netperf_demo',})
    ])

PLOTS['iperf_delay']    = {'description': 'Netperf completion time',
                             'type': 'cdf_combine',
                          'group_by': 'groups_concat',
                             'series': [
                              # when duplicates is used the name is changed to e.g.:
                              # Started NetperfDemoRunner idx 4 ('TCP netperf::5')
                              # thus we need to use the glob

                                 {'data': glob('TCP netperf*'),
                                  'label': 'completion delay (ms)',
                                      'combine_mode': 'meta:ELAPSED_TIME'
                               }]
                          }

Well, this should more or less work, I think. However, you only get one data point to plot here; the 5 repetitions are not going to be five data points to plot a CDF over; they are going to be 5 different series with one datapoint each.

To get multiple datapoints, you'd need to run multiple tests (for s in $(seq 10); do flent <args>; done), then plot all data files at once.

How can I get
 'TCP netperf::2': <ElapsedTime2>,
 'TCP netperf::3': <ElapsedTime3>,
 'TCP netperf::4': <ElapsedTime4><-- OUTPUT END --
and then plot the cdf of it ? maybe that's not how it works, and I shouldn't care about the 'TCP netperf::2': [] data since ELAPSED_TIME is metadata ? it really looks like magic as it's kind of DSL mixed with python.

See above; you are right that you should just ignore those data points ;)

And yeah, this is really a DSL; it's just Python because that was easier than coming up with a full DSL myself. I've been thinking about changing it, but, well, that is a different subject...

-Toke

teto commented 6 years ago

Yeah, this is an issue with netperf; it'll only output time in 10-ms thanks for pointing this out. Not that useful for intra datacenter latencies then. My final tests should effectively take longer but it's nice to iterate over shorter transfers.

To get multiple datapoints, you'd need to run multiple tests (for s in $(seq 10); do flent ; done), then plot all data files at once. But each run will be recorded in a different .flent.gz ? which goes back to my initial question, how can I plot data directly from the flent.gz without re-running the tests ? thanks for the help

tohojo commented 6 years ago

Matthieu Coudron notifications@github.com writes:

Yeah, this is an issue with netperf; it'll only output time in 10-ms

thanks for pointing this out. Not that useful for intra datacenter latencies then. My final tests should effectively longer but it's nice to iterate over shorter transfers.

Well it's more a function of the transfer size, but yeah, this is a bit annoying...

To get multiple datapoints, you'd need to run multiple tests (for s in $(seq 10); do flent ; done), then plot all data files at once.

But each run will be recorded in a different .flent.gz ? which goes back to my initial question, how can I plot data directly from the flent.gz without re-running the tests ?

Yes. If you want to collect several data points in a single .flent.gz, the timeseries aggregator is not going to work. You'll have to use the iteration aggregator (set AGGREGATOR='iteration' in the test config file), and use a different runner that outputs the ELAPSED_TIME as a single data point. I think it should be enough to subclass the netperfdemorunner, and redefine parse() to call the parent, pull out the ELAPSED_TIME metadata, and return that.

-Toke

teto commented 6 years ago

I started subclassing and then got other errors. I need to plot this for a deadline so I prefer to stay on the safe side and go back to writing some simple scripts I feel bad to let you down after you helped me so much sorry, I had underestimated the difficulty :'(

tohojo commented 6 years ago

Matthieu Coudron notifications@github.com writes:

I started subclassing and then got other errors. I need to plot this for a deadline so I prefer to stay on the safe side and go back to writing some simple scripts I feel bad to let you down after you helped me so much sorry, I had underestimated the difficulty :'(

Right, no worries. I have a looming deadline myself, or I would have been of more help :)

BTW, if you want to use the plotting facilities of Flent with your own dataset you can do 'from flent import resultset' to build your own data files...