Go through an intermediate CSV before doing graphs

tbarbette commented 2 years ago

People get confused about the cache, the graphs and CSV.
When you create a figure, then move to other experiments and add other variables, you can't re-do the first figure without "removing" the new variables, because the cache doesn't know what variables were added. E.g. if you do a graph about TCP stuffs, then decide to try with CONGESTION={vegas,cubic,bbr}, you'll have to re-run old tests because before that point, the congestion control used was whatever the default on the system. In paper rush, it might be stressful to have to re-do tests or deactivate new variables to rebuild the graph (and dangerous, imagine BBR was set system-wide, then you're actually not in the same conditions).

So the idea would be to keep the cache idea as hidden as possible, and always export a CSV that will be used to create the graphs. The npf commands will continue to build graph automatically but a new npf-graph command would allow to rebuild the very same graph from the CSV.

The question left therefore would be what would be the appropriate CSV format. Knowing we have multiple output variables, and multiple runs per parameters, and also multiple series when using npf-compare.

Imagine we compare netperf and iperf, have one variable "ZEROCOPY" that can have values 0 and 1 and have two outputs results THROUGHPUT and LATENCY, and do 2 runs :

series,run_number,ZEROCOPY,THROUGHPUT,LATENCY iperf,1,0,... iperf,1,1,... iperf,2,0,... iperf,2,1,... netperf,1,0... netperf,1,1... netperf,2,0... netperf,2,1... The problem is still that some "output" (results) can have multiple values in the same run. We could use another (a bit non standard) separator to have multiple results in a single column, eg using the "+" sign (using a ";" might lead to bad interpretation of CSV).

Any input on this?

MassimoGirondi commented 2 years ago

What about storing all the intermediate data in a binary format? Pickle is the first that come in mind.

Not the most elegant solution but it could abstract from having to save each individual combination of parameters for each particular run and have to write your custom csv syntax. Then it's a matter of separating the testing and graphing parts, invoking each before or after the (de)serializer when you want to do the graphing or only export the results.

You can see it as a sort of "snapshot" of the results in that particular run.

tbarbette commented 2 years ago

In the first versions I used pickle actually. But as the format evolved I suffered from backward incompatible opening and had to re-execute tests. The advantage of a kind of CSV is that it's human readable. But yes, it begins to be complex with multiple results.

And I did not mention the problem of time series... How to store a dozen results over the duration of the experiment, at time intervals that are different for each experiment...

tbarbette commented 2 years ago

Maybe the CSV is still the best way to handle this, with just weird format for weird use cases (multiple-results per run). And maybe one CSV file per time series (again, time series are not necessary in all experiments).

MassimoGirondi commented 2 years ago

JSON? I'm not a huge fan for cases like this but it could be a good tradeoff, allowing basic human readability but allowing nesting and objects-like syntax...

tbarbette / npf

Go through an intermediate CSV before doing graphs #26