An experimental performance test

lukego commented 8 years ago

I did a really basic experiment in the spirit of #688, #741, #692 and following a discussion the Snabb Slack about tuning engine parameters. I am really keen to do performance tuning work based on well-defined repeatable experiments that measure a broad set of applications and workloads. This is a baby step in that direction.

Specifically, this experiment is to see how changing two parameters affects performance as measured by make benchmarks. The first parameter is engine.busywait with values true and false. The second parameter is intel10g.num_descriptors with values 512, 1024, 2048, 4096, 16384, and 32768. There are 12 distinct combinations of parameters and four benchmarks (basic1, packetblaster-*, snabbnfv-loadgen-dpdk). I run this in a loop for a few hours for a total of 179 tests.

Setup

The tests are driven by a shell script that loops through each combination of parameters. Each parameter value is represented by a Git branch that is merged into the test environment, meaning that selecting a parameter value can make an arbitrary code change.

#!/usr/bin/env bash                                                                                                                   
set -e
n=1
for busy in master busywait; do
  for ndesc in master ndesc/1024 ndesc/2048 ndesc/4096 ndesc/16384 ndesc/32768; do
    echo "run: $n"; n=$[n+1]
    date
    git checkout -B "${busy}+${ndesc}" master
    # grab the '#!/usr/bin/env bash' change for nixos compat                                                                          
    git cherry-pick 1f83b5cb143a41ea8d6bfa71e9315cd9859a59a8 >/dev/null
    git merge -q --no-edit $busy
    git merge -q --no-edit $ndesc
    make >/dev/null
    make benchmarks
  done
done

Here is a summary of how the parameter branches look:

$ git log --oneline master..busywait
0d04c04 engine: Change default to busywait=true
$ git log --oneline master..ndesc/16384
e95c998 intel10g: Set TX/RX ring size to 16384 packets

and here is the script that I used to create the ndesc/nnn parameter branches:

#!/usr/bin/env bash                                                                                                                   
set -e
for ndesc in 1024 2048 4096 8192 16384 32768; do
  git checkout -B ndesc/${ndesc} master
  sed -i -e "s/^num_descriptors = .*/num_descriptors = ${ndesc}/" apps/intel/intel10g.lua
  git add apps/intel/intel10g.lua
  git commit -m "intel10g: Set TX/RX ring size to ${ndesc} packets"
done

Data

Data is here: full test log.

Here is an excerpt showing a test result with busywait=true and num_descriptors=1024:

run: 8
Sat Feb  6 10:50:24 UTC 2016
Switched to and reset branch 'busywait+ndesc/1024'
basic1-100e6 32.9 -
packetblaster-64 12.65 -
packetblaster-synth-64 12.80 -
snabbnfv-iperf-1500 0 -
snabbnfv-iperf-jumbo 0 -
snabbnfv-loadgen-dpdk 5.115 -

The data could hopefully be massaged into a suitable format (e.g. CSV) using a suitable tool (e.g. AWK).

The iperf benchmarks were not included because I couldn't get them running on lugano-1. Other applications like ALX, lwAFTR, Lisper are not covered because they are not currently in the master repository or integrated with make benchmarks (for this kind of work it would be great if they were!).

Analysis

I have not analyzed the data. I would love help from somebody who is good at that!

Here are the questions I would like to answer:

How does busywait affect performance, if at all?
How does ndesc affect performance, if at all?
Are the effects of busywait and ndesc independent?
Can we leave busywait=false as default without hurting performance?
How high can we set ndesc without losing performance?
How suitable is the data for answering these questions? (Enough data points? How much background variation? Outliers to explain?)
Anything else interesting in the data?
Anything that should be done differently in the next experiment?
Next steps

This is a very simple experiment. Here are some immediate things I would like to improve:

Control the whole test environment e.g. all software and kernel settings. For example separately test with 2MB and 1GB huge page mode on Linux (switch by rebooting with new grub parameters). Perhaps run the tests via nixops on the support server (eiger.snabb.co)?
Archive everything: the individual parameter branches, the merged parameter combination branches, and the results.
Get a basic workflow for looking at the data (e.g. import into Google Spreadsheet or Torch?).
Include more interesting benchmarks in the results.

lukego commented 8 years ago

cc @domenkozar @eugeneia

sleinen commented 8 years ago

I quickly converted this to CSV, uploaded it to Google Drive, and created a simple line chart. https://docs.google.com/spreadsheets/d/1Guo_iBC5P0i5uAkvJHWirZy_Z9nFxxyzOee9LIrraG4/edit#gid=1289069081 Don't see any significant variation between the branches. A few outliers are due to runs with zero results. In principle this is something that could be automated using the Drive API, although there may be better ways to to this. There are certainly smarter ways to visualize the results—this is just a quick hack.

lukego commented 8 years ago

@sleinen Cool! This chart shines a light on the data in a pretty interesting way.

First that there is quite a bit of variation in the "basic1" benchmark results. The highest and lowest values differ by around 20% and there is quite a spread in between. It would be interesting to account for this (CPU? OS? JIT?) and control it (eliminate with isolcpus? reduce with longer runs? control by averaging multiple runs?). For the moment the basic1 benchmark seems not that practical for evaluating the performance impact of code changes if indeed the result can vary by 20% due to random chance.

Second that the packetblaster benchmark seems to have around a 3% failure rate. That is high! This may well be caused by an important bug in our software and warrants investigation.

I am surprised that there is so much low hanging fruit in looking at the data. And this is even when the DPDK benchmark results seem to be missing from your visualization, and those are the only ones that I expected to be potentially interesting :).

How do you make charts like that in Google Drive? How might I have made it less bothersome to generate CSV?

lukego commented 8 years ago

packetblaster benchmark seems to have around a 3% failure rate. That is high!

(Out of curiosity I just ran the packetblaster benchmark manually 500 times and didn't see any failures. I wonder what happened in the experimental run. Suggests that we should preserve logs/output!)

kbara commented 8 years ago

Yep, that's an extremely good idea. Nothing like finding a horrifically bad result you can't reproduce and aren't sure if it was an incidental artefact or something real, after all.

lukego commented 8 years ago

@sleinen Oh now I see the "Explore" button on the spreadsheet. That is super convenient!

lukego commented 8 years ago

I also did a Google Sheets import and this time included the DPDK results: https://docs.google.com/spreadsheets/d/165a9OKs5Q3yHxUX15nSrsGvTeLwvwK-U1AyQ7rgskDE/edit?usp=sharing

Anybody knows how to get Google to tell us what is going on there?

The output was a bit of a pain to massage into CSV. I used an Emacs macro but would prefer to use e.g. AWK. I wonder if it would make sense to capture all output, have everything addressable to screen-scraping, but also try to make known-interesting values easy to pick up e.g. with a convention like @parameter foo 42 and @result mybenchmark 96 to make it easy to cherry-pick values for CSV/JSON/etc. Could be handy to include a bunch of "extra" information e.g. syslog entries during each test, etc...

sleinen commented 8 years ago

Second that the packetblaster benchmark seems to have around a 3% failure rate. That is high!

Note that your gist also contained two runs that my stupid Emacs query-replace-regexp didn't convert because they contained the unexpected string "Terminated" between the snabbnfv-iperf-jumbo and snabbnfv-loadgen-dpdk results:

run: 12
Sat Feb  6 13:08:39 UTC 2016
Switched to and reset branch 'busywait+ndesc/32768'
basic1-100e6 32.7 -
packetblaster-64 12.65 -
packetblaster-synth-64 12.69 -
snabbnfv-iperf-1500 0 -
snabbnfv-iperf-jumbo 0 -
Terminated
snabbnfv-loadgen-dpdk 0 -
run: 8
Sat Feb  6 13:49:22 UTC 2016
Switched to and reset branch 'busywait+ndesc/1024'
basic1-100e6 31.9 -
packetblaster-64 12.69 -
packetblaster-synth-64 0 -
snabbnfv-iperf-1500 0 -
snabbnfv-iperf-jumbo 0 -
Terminated
snabbnfv-loadgen-dpdk 0 -

...as well as a third ungrokkable entry at the end. I think that one was truncated, as it misses the final snabbnfv-loadgen-dpdk line that all the other entries had.

sleinen commented 8 years ago

Anybody knows how to get Google to tell us what is going on there?

Select appropriate region Menu: Insert -> Chart -> Type "line chart", play around with options until it looks nice Maybe use "Recommendations"; this automagically ignores boring columns such as the ones with all-0 results.

Then "Move to separate sheet" the resulting chart using its menu.

I did this to a copy of your spreadsheet:

https://docs.google.com/spreadsheets/d/16pkBveTUa6vFRLSnJle2nqEdmIwVsFZYHjoYdHdoVtw/edit?usp=sharing

Note I also added a row 1 with the benchmark titles, and used them as legend for the graphs.

I could only use column B as the index, so this completely mixes "master" and "busywait". It would probably be better to combine each pair of a master+busywait row into a single row with separate result columns for master and for busywait, so that the chart then contains separate graphs for these cases.

Probably the even smarter thing to do would be to learn R and statistics and generate publication-quality graphs with the proper statistic metrics on it. But I don't have time for that now :-)

kbara commented 8 years ago

The R/graphing stuff isn't that hard; I had to do it as an undergrad, and it'd take anyone here an hour or two, tops. Deciding what the proper statistical tools to use are is a bit hairier. :-)

lukego commented 8 years ago

So many different tools available: R, Torch, IPython, gnuplot+awk/guile/perl...

Coolest link I found today is Agile Visualization and its suite of visualization tools based on Pharo Smalltalk. That looks like potentially a convenient way to import/massage data, and do numerical charting like we are here, and also visualize things like LuaJIT trace structures and profiler results. See demo video.

In the past I have started writing bits of code for analyzing LuaJIT trace dumps in Emacs Lisp but that just feels wrong in this day and age :). Still: these are all exotic tools so I am sure it is important to keep our options open by generating simple formats e.g. CSV/JSON/text.

snabbco / snabb