Open lukego opened 8 years ago
cc @domenkozar @eugeneia
I quickly converted this to CSV, uploaded it to Google Drive, and created a simple line chart. https://docs.google.com/spreadsheets/d/1Guo_iBC5P0i5uAkvJHWirZy_Z9nFxxyzOee9LIrraG4/edit#gid=1289069081 Don't see any significant variation between the branches. A few outliers are due to runs with zero results. In principle this is something that could be automated using the Drive API, although there may be better ways to to this. There are certainly smarter ways to visualize the results—this is just a quick hack.
@sleinen Cool! This chart shines a light on the data in a pretty interesting way.
First that there is quite a bit of variation in the "basic1" benchmark results. The highest and lowest values differ by around 20% and there is quite a spread in between. It would be interesting to account for this (CPU? OS? JIT?) and control it (eliminate with isolcpus? reduce with longer runs? control by averaging multiple runs?). For the moment the basic1 benchmark seems not that practical for evaluating the performance impact of code changes if indeed the result can vary by 20% due to random chance.
Second that the packetblaster benchmark seems to have around a 3% failure rate. That is high! This may well be caused by an important bug in our software and warrants investigation.
I am surprised that there is so much low hanging fruit in looking at the data. And this is even when the DPDK benchmark results seem to be missing from your visualization, and those are the only ones that I expected to be potentially interesting :).
How do you make charts like that in Google Drive? How might I have made it less bothersome to generate CSV?
packetblaster benchmark seems to have around a 3% failure rate. That is high!
(Out of curiosity I just ran the packetblaster benchmark manually 500 times and didn't see any failures. I wonder what happened in the experimental run. Suggests that we should preserve logs/output!)
Yep, that's an extremely good idea. Nothing like finding a horrifically bad result you can't reproduce and aren't sure if it was an incidental artefact or something real, after all.
@sleinen Oh now I see the "Explore" button on the spreadsheet. That is super convenient!
I also did a Google Sheets import and this time included the DPDK results: https://docs.google.com/spreadsheets/d/165a9OKs5Q3yHxUX15nSrsGvTeLwvwK-U1AyQ7rgskDE/edit?usp=sharing
Anybody knows how to get Google to tell us what is going on there?
The output was a bit of a pain to massage into CSV. I used an Emacs macro but would prefer to use e.g. AWK. I wonder if it would make sense to capture all output, have everything addressable to screen-scraping, but also try to make known-interesting values easy to pick up e.g. with a convention like @parameter foo 42
and @result mybenchmark 96
to make it easy to cherry-pick values for CSV/JSON/etc. Could be handy to include a bunch of "extra" information e.g. syslog entries during each test, etc...
Second that the packetblaster benchmark seems to have around a 3% failure rate. That is high!
Note that your gist also contained two runs that my stupid Emacs query-replace-regexp didn't convert because they contained the unexpected string "Terminated" between the snabbnfv-iperf-jumbo
and snabbnfv-loadgen-dpdk
results:
run: 12
Sat Feb 6 13:08:39 UTC 2016
Switched to and reset branch 'busywait+ndesc/32768'
basic1-100e6 32.7 -
packetblaster-64 12.65 -
packetblaster-synth-64 12.69 -
snabbnfv-iperf-1500 0 -
snabbnfv-iperf-jumbo 0 -
Terminated
snabbnfv-loadgen-dpdk 0 -
run: 8
Sat Feb 6 13:49:22 UTC 2016
Switched to and reset branch 'busywait+ndesc/1024'
basic1-100e6 31.9 -
packetblaster-64 12.69 -
packetblaster-synth-64 0 -
snabbnfv-iperf-1500 0 -
snabbnfv-iperf-jumbo 0 -
Terminated
snabbnfv-loadgen-dpdk 0 -
...as well as a third ungrokkable entry at the end. I think that one was truncated, as it misses the final snabbnfv-loadgen-dpdk
line that all the other entries had.
Anybody knows how to get Google to tell us what is going on there?
Select appropriate region Menu: Insert -> Chart -> Type "line chart", play around with options until it looks nice Maybe use "Recommendations"; this automagically ignores boring columns such as the ones with all-0 results.
Then "Move to separate sheet" the resulting chart using its menu.
I did this to a copy of your spreadsheet:
https://docs.google.com/spreadsheets/d/16pkBveTUa6vFRLSnJle2nqEdmIwVsFZYHjoYdHdoVtw/edit?usp=sharing
Note I also added a row 1 with the benchmark titles, and used them as legend for the graphs.
I could only use column B as the index, so this completely mixes "master" and "busywait". It would probably be better to combine each pair of a master+busywait row into a single row with separate result columns for master and for busywait, so that the chart then contains separate graphs for these cases.
Probably the even smarter thing to do would be to learn R and statistics and generate publication-quality graphs with the proper statistic metrics on it. But I don't have time for that now :-)
The R/graphing stuff isn't that hard; I had to do it as an undergrad, and it'd take anyone here an hour or two, tops. Deciding what the proper statistical tools to use are is a bit hairier. :-)
So many different tools available: R, Torch, IPython, gnuplot+awk/guile/perl...
Coolest link I found today is Agile Visualization and its suite of visualization tools based on Pharo Smalltalk. That looks like potentially a convenient way to import/massage data, and do numerical charting like we are here, and also visualize things like LuaJIT trace structures and profiler results. See demo video.
In the past I have started writing bits of code for analyzing LuaJIT trace dumps in Emacs Lisp but that just feels wrong in this day and age :). Still: these are all exotic tools so I am sure it is important to keep our options open by generating simple formats e.g. CSV/JSON/text.
I did a really basic experiment in the spirit of #688, #741, #692 and following a discussion the Snabb Slack about tuning engine parameters. I am really keen to do performance tuning work based on well-defined repeatable experiments that measure a broad set of applications and workloads. This is a baby step in that direction.
Specifically, this experiment is to see how changing two parameters affects performance as measured by
make benchmarks
. The first parameter isengine.busywait
with valuestrue
andfalse
. The second parameter isintel10g.num_descriptors
with values512
,1024
,2048
,4096
,16384
, and32768
. There are 12 distinct combinations of parameters and four benchmarks (basic1, packetblaster-*, snabbnfv-loadgen-dpdk). I run this in a loop for a few hours for a total of 179 tests.Setup
The tests are driven by a shell script that loops through each combination of parameters. Each parameter value is represented by a Git branch that is merged into the test environment, meaning that selecting a parameter value can make an arbitrary code change.
Here is a summary of how the parameter branches look:
and here is the script that I used to create the
ndesc/nnn
parameter branches:Data
Data is here: full test log.
Here is an excerpt showing a test result with
busywait=true
andnum_descriptors=1024
:The data could hopefully be massaged into a suitable format (e.g. CSV) using a suitable tool (e.g. AWK).
The iperf benchmarks were not included because I couldn't get them running on
lugano-1
. Other applications like ALX, lwAFTR, Lisper are not covered because they are not currently in the master repository or integrated withmake benchmarks
(for this kind of work it would be great if they were!).Analysis
I have not analyzed the data. I would love help from somebody who is good at that!
Here are the questions I would like to answer:
busywait
affect performance, if at all?ndesc
affect performance, if at all?busywait
andndesc
independent?busywait=false
as default without hurting performance?ndesc
without losing performance?Next steps
This is a very simple experiment. Here are some immediate things I would like to improve: