snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.96k stars 297 forks source link

NFV test matrix preview #976

Open lukego opened 8 years ago

lukego commented 8 years ago

Check out this new NFV test matrix report!

This is a fully automated workflow where Hydra detects changes on branches under test, executes many end-to-end benchmarks (>10,000) with VMs in different setups, and produces a report that shows benchmark results both in broad overview and slice-and-diced by different factors. The tests run on a 10-machine cluster and take around one day to complete. Awesome work, @domenkozar!!!

We should be able to hook this up to upstream branches including master and next once we land #969 that allows Snabb NFV to run without a physical hardware NIC. Just for now it is connected to a couple of branches that do include this PR.

Observations

This is a preview in the sense that we have not been able to thoroughly sanity-check the results yet and some of them are different than we have seen in the past. So take all of this with a grain of salt.

On the Overall graph we can see that the two Snabb branches being tested, matrix and matrix-next, look practically identical.

On the iperf graph we can see results clustered around three places: 0 Gbps (failure), 6 Gbps, and 10-15 Gbps. On the iperf configuration graph we can see the explanation: the filter tests are failing (likely our packet filter is blocking iperf); the ipsec tests are delivering ~6 Gbps; and the baseline and L2TPv3 tests are delivering ~10-15 Gbps.

On the l2fwd graph we can see that most results are failing. Question is, why? I am not sure. The l2fwd success and failure stats may point to QEMU version and packet size being important factors? (These tests are much more reliable with SnabbBot so what is the difference here? Can be related to the guest kernel.)

Onward

So! Now we can automatically see whether changes are good for Snabb NFV. We want to merge changes that make the humps higher (more consistency) and that move the area under the curves to the right (faster average). Then we also want harsher test cases that reveal problems by spreading the curves out and moving them to the left :).

If this works well in practice then we could gradually add coverage for all the other Snabb applications too.

Feedback?

Questions, suggestions, comments?

kbara commented 8 years ago

Looks awesome overall. The main thing that worries me is the lack of any indication of expected error/noise - when I see two overlapping graphs that diverge a bit, I have no idea how likely it is to be significant or random.

lukego commented 8 years ago

@kbara Yes I see what you mean about the difficulty of knowing whether a variation is significant or random noise. It will be interesting to see how the graphs look when we compare code that actually does have different performance. I have added next to the large matrix test now so we should soon be able to see the effect of fixing the counters performance regression, fixing the filter-blocks-iperf test setup bug, and fixing the errors in the dpdk benchmark (#965). I am guessing that these changes will really jump out of the graphs but we will see.

Could also be that we can improve the plots to convey more information about significance. There are some examples in Plotting distributions (ggplot2).

If anybody should feel inspired to muck around with the graphs here is a quick start:

  1. Install RStudio. Lovely program!
  2. Download the latest large-matrix CSV file from Hydra.
  3. Play with the Rmarkdown reports either by loading them directly or just copy-pasting a few lines.
lukego commented 8 years ago

Couple more observations about the report...

l2fwd by QEMU and DPDK

Wow! Looking more closely at the sliced l2fwd results there are a couple of really significant things: QEMU 2.4.1 works better than the others and DPDK 1.8 doesn't work at all.

Here is how the data looks when we plot the success rate of the l2fwd benchmark separately for each QEMU and DPDK version combination:

success = summarize(group_by(l2fwd, qemu, dpdk), success = mean(!is.na(score)))
ggplot(success, aes(qemu, dpdk, fill=success)) +
  geom_tile(aes(fill=success)) + scale_fill_gradient(low="red", high="white") +
  geom_text(aes(label = scales::percent(success)))

rplot03

and this suggests a lot of interesting ideas:

This is awesome! Go Hydra!

iperf duplicate rows

The iperf success and failure counts are all multiples of 5. The CSV seems to contain duplicate rows like this:

iperf,NA,base,matrix,3.18.29,2.1.3,NA,1,17.3,Gbps
iperf,NA,base,matrix,3.18.29,2.1.3,NA,1,17.3,Gbps
iperf,NA,base,matrix,3.18.29,2.1.3,NA,1,17.3,Gbps
iperf,NA,base,matrix,3.18.29,2.1.3,NA,1,17.3,Gbps
iperf,NA,base,matrix,3.18.29,2.1.3,NA,1,17.3,Gbps

@domenkozar Can be something with the nix expression? For example could it be that the matrix is including a row for each different DPDK version (there are 5) and then reusing the result since the test does not use that software?

Stats and modelling

Just a thought: R has some really fancy features for fitting data to models and e.g. telling you which factor explains the most variation in results. So if we would write some nice expressions then R should be automatically tell us "QEMU version and DPDK version each have a huge impact on l2fwd results."

lukego commented 8 years ago

Hypothesis for what is going on with QEMU here:

The original QEMU vhost-user feature that we contributed to QEMU 2.1 did not support resetting the Virtio-net device. This is simply something that we missed and lacked a test for. The consequence is that if the guest shuts down its Virtio-net device and reuses the descriptor memory for other things then it will lead to making garbage DMA requests e.g. causing an error in the vswitch or overwriting memory in the guest.

Later we found this problem when we tested rebooting VMs. Reboots failed, presumably due to memory corruption during boot. We fixed this by extending QEMU to send a notification when the guest device shuts down so that the vswitch knows to stop processing DMA requests. This fix went upstream but was later reverted for reasons that I do not fully understand.

So my hypothesis is that QEMU 2.4.1 includes the fix, but none of the others do, and the error is hand-over inside the VM from the kernel driver to the DPDK driver.

The reason we did not see this with SnabbBot, even with other QEMU versions, is that we setup those VMs to dedicate the Virtio-net device to DPDK i.e. prevented the kernel from initializing it at boot. The new VMs bootstrapped with nix are first letting the kernel virtio-net driver initialize the device and then handing it over to DPDK. This means that the nix images would depend on the QEMU patch to succeed while the SnabbBot ones would not.

I'll get in touch with QEMU upstream and see what they reckon. I would certainly like to include device reset in our test suite and have that working reliably so I am glad that the nix images are exercising this scenario and showing the problem.

lukego commented 8 years ago

Hard to put this stuff down :-).

Next question: How come we still have 1% failures with QEMU == 2.4.1 and DPDK != 1.8.0?

First we could take a peek at some basic statistics about these rows:

# Summarize rows for failing tests with "good" qemu and dpdk
summary(subset(l2fwd,
              subset = qemu == '2.4.1' & is.na(score) & dpdk != '1.8.0',
              select = c(pktsize, config, snabb, dpdk)))

 pktsize   config           snabb       dpdk  
 64 :9   base : 0   matrix     : 3   1.8.0:0  
 256:8   noind: 3   matrix-next: 4   16.04:8  
         nomrg:14   next       :10   2.0.0:2  
                                     2.1.0:3  
                                     2.2.0:4 

Now we have really sliced-and-diced the data: only 17 rows left out of the 37,800 results in the CSV file.

Quick observations:

Question is, how confident can we be about this? I think that the R aov (Analysis of Variance) feature can answer this for us:

passfail <- subset(l2fwd, subset = qemu == '2.4.1' & dpdk != '1.8.0')
passfail$score[!is.na(passfail$score)] <- TRUE
passfail$score[is.na(passfail$score)]  <- FALSE
summary(aov(score ~ pktsize + config + snabb + dpdk, data=passfail))

              Df Sum Sq Mean Sq F value Pr(>F)    
pktsize        1  0.023 0.02315   2.992 0.0838 .  
config         2  0.126 0.06300   8.144 0.0003 ***
snabb          2  0.040 0.01991   2.574 0.0765 .  
dpdk           3  0.038 0.01281   1.656 0.1745    
Residuals   2151 16.639 0.00774                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I believe this means that we can be 99.9% confident (***) that the configuration is affecting the failures, but only ~10% confident (.) that Snabb version or DPDK version is.

So now we know of one bug somewhere in the test setup - failures when virtio-net options are suppressed - and we know what further testing to do for context - something like 10,000 more tests with QEMU 2.4.1 and DPDK ~= 1.8 so that we can be sure whether software versions are significant (would be handy to know for debugging purposes).

We could also browse Hydra to find the logs for these 17 failing tests and review them. First step could be dumping the 17 CSV rows in order to identify the relevant test cases.

lukego commented 8 years ago

Looks to me like the reason for the 1% failures is mostly cases that run slowly and bump a 160-second timeout that we have in the Nix expression for the benchmark. Could be an idea to increase that timeout somewhat so that we can better differentiate between slow and failed cases. However, I hold off on that for the moment because redefining the benchmark definition would force Hydra to do a lot of work (rerunning tests for every branch based on the new definitions).

I dumped the list of failed cases like this:

d <- read.csv("/home/lukego/Downloads/bench (1).csv")
subset(d, subset = benchmark=='l2fwd' & qemu=='2.4.1' & dpdk!='1.8.0' & is.na(score))

      benchmark pktsize config       snabb  kernel  qemu  dpdk id score unit
4634      l2fwd      64  nomrg      matrix 3.18.29 2.4.1 16.04 14    NA Mpps
4661      l2fwd      64  noind      matrix 3.18.29 2.4.1 16.04 11    NA Mpps
4904      l2fwd     256  nomrg matrix-next 3.18.29 2.4.1 16.04 14    NA Mpps
4997      l2fwd      64  nomrg matrix-next 3.18.29 2.4.1 16.04 17    NA Mpps
5017      l2fwd      64  noind matrix-next 3.18.29 2.4.1 16.04  7    NA Mpps
5035      l2fwd      64  noind matrix-next 3.18.29 2.4.1 16.04 25    NA Mpps
5269      l2fwd     256  nomrg        next 3.18.29 2.4.1 16.04 19    NA Mpps
5274      l2fwd     256  nomrg        next 3.18.29 2.4.1 16.04 24    NA Mpps
12811     l2fwd     256  nomrg        next 3.18.29 2.4.1 2.2.0  1    NA Mpps
12812     l2fwd     256  nomrg        next 3.18.29 2.4.1 2.2.0  2    NA Mpps
12837     l2fwd     256  nomrg        next 3.18.29 2.4.1 2.2.0 27    NA Mpps
12913     l2fwd      64  nomrg        next 3.18.29 2.4.1 2.2.0 13    NA Mpps
20374     l2fwd     256  nomrg        next 3.18.29 2.4.1 2.1.0  4    NA Mpps
20479     l2fwd      64  nomrg        next 3.18.29 2.4.1 2.1.0 19    NA Mpps
20485     l2fwd      64  nomrg        next 3.18.29 2.4.1 2.1.0 25    NA Mpps
27229     l2fwd     256  nomrg      matrix 3.18.29 2.4.1 2.0.0 19    NA Mpps
28037     l2fwd      64  nomrg        next 3.18.29 2.4.1 2.0.0 17    NA Mpps

Then I manually transcribed these into job names and searched Hydra for their build links:

4634      l2fwd      64  nomrg      matrix 3.18.29 2.4.1 16.04 14    NA Mpps
benchmarks.l2fwd_pktsize=64_conf=nomrg_snabb=matrix_dpdk=16-04_qemu=2-4-1_num=14
https://hydra.snabb.co/build/151505/

4661      l2fwd      64  noind      matrix 3.18.29 2.4.1 16.04 11    NA Mpps
benchmarks.l2fwd_pktsize=64_conf=noind_snabb=matrix_dpdk=16-04_qemu=2-4-1_num=11
https://hydra.snabb.co/build/155072

4904      l2fwd     256  nomrg matrix-next 3.18.29 2.4.1 16.04 14    NA Mpps
benchmarks.l2fwd_pktsize=256_conf=nomrg_snabb=matrix-next_dpdk=16-04_qemu=2-4-1_num=14
https://hydra.snabb.co/build/150561

5269      l2fwd     256  nomrg        next 3.18.29 2.4.1 16.04 19    NA Mpps
benchmarks.l2fwd_pktsize=256_conf=nomrg_snabb=next_dpdk=16-04_qemu=2-4-1_num=19
https://hydra.snabb.co/build/181719

and what I see in most (but not all) of the logs I checked is low speeds (~1Mpps or less).

lukego commented 8 years ago

@kbara I pushed a new version of the report that includes overlay histograms now. The Y-axis also now gives the exact number of results per bin. Better? https://hydra.snabb.co/build/203152/download/2/report.html

(I also added the red-and-white "success rate" tables. I have a fix in the pipeline that will make the l2fwd picture look much better soon.)

kbara commented 8 years ago

Possibly better, but I honestly don't understand what is going on in those graphs without digging deeper. What is 'matrix'? How does 'matrix-next' differ from plain old next, since they're both being measured? What are score and count? What are their units, if applicable?

IE, if I look under 'by benchmark', l2fwd goes up to nearly 12000, but iperf up to about 1000. I can make (uncertain) guesses as to what that means, but it seems unnecessarily opaque.

The success/failure data looks very readable.

lukego commented 8 years ago

@kbara matrix and matrix-next are the names of Snabb branches being tested. (Could alternatively be e.g. master and kbara-next.) So what we are doing here is comparing how different Snabb branches/versions perform.

Score is Gbps (benchmark=iperf) or Mpps (benchmark=l2fwd). i.e. it is a scalar value saying how fast a test went (higher is better). we kind-of get away with mixing the units because they vaguely similar in magnitude.

Count is the number of test results that fell within a range of scores. For example in the first graph, on the left, we can see that every branch had around 6,000 failed tests (histogram bucket at 0).

That help?

lukego commented 8 years ago

Cooooool!

Here is a fresh new NFV test matrix report that is much more exciting. This compares master with branch nfv-test that includes two small fixes for low-hanging fruit (#984 and #985) that have been causing most of the failures in this test env.

@kbara I would be really interested to hear your take on what the report says! (Have only glanced at it so far myself but itching to take a closer look :-))

kbara commented 8 years ago

I still really wish there were units on the axes - I have no idea on initially looking at the first graph if '15' is Gbps (on two cards? bidirectional?) or MPPS (your comment above about iperf vs l2fwd, plus the graph below it, makes me think Gbps, but it would be so, so much easier to just read that. Some of the graphs include words like iperf - they could just say 'iperf - Gbps', though a label on the axis would match my preferences from school and reading research papers).

How does it choose how many times to run a test? IE, on the overall summary, it looks like there are probably a lot more runs of master than nfv-test (certainly a lot more zeros, then the other results seem to look similar) - are there? If so, why? Also, doesn't that make the two smooth graphs impossible to compare without scaling? [Edit: actually, it looks like a density graph, so disregard the last question.]

Why do the nfv-test scores seem to be so much better than the master scores?

I continue to really like how clear this format makes failure data. I think it needs a bit of tweaking on conveying the rest of the data, but it's improving!

lukego commented 8 years ago

@kbara Thanks for taking the time to read all of this and give feedback. I am working this out as I go along so it is very helpful to be able to discuss it.

Zooming out for one sec, I see two distinct problems to solve with these benchmarks:

  1. Deployment planning: measuring the absolute performance of software in different scenarios (hardware, configuration, workload, etc) so that you can deploy it successfully.
  2. Software optimization: measuring the relative performance of software versions to evaluate whether a change would have a positive/negative/neutral impact on performance.

On the one hand for deployment planning we need to really care about what we are measuring and how to interpret the results in real-world terms. On the other hand for software optimization it may be reasonable to treat the benchmark results more abstractly, e.g. simply as "scores" for which higher values are linearly better. (Like e.g. SPECint: I can't remember what workload that represents but I know the GCC people would be thrilled to get a patch that increases the score by 2%.)

Just now I am mostly wearing the "software optimization" hat and that is why I am being loose with units. The benchmark scores are numbers and I want to make them higher (move ink to the right) and more consistent (move ink upwards).

Coming back to your comments: I think we are seeing the limitations of the cookie-cutter approach of generating canned graphs for what is, for now at least, more of an exploration than a quantification. Indeed we can see that nfv-test improves on master by having fewer failures, and more results distributed across the other buckets in the histogram, but we can't really see if there are other effects hidden behind this big and dramatic one e.g. whether successful tests tend to be faster or slower.

Have to see how the situation evolves with experience i.e. how much can we effectively summarize with predefined graphs and statistics (as few as possible) and when do we need to roll up our sleeves for some data analysis.

Relatedly, here is my R bookshelf at the moment:

New case study

Just at braindump quality but maybe interesting anyway: rough draft analysis of some tests that I ran over the weekend to evaluate the impact of a QEMU patch that never went upstream.

tobyriddell commented 8 years ago

I want to de-lurk briefly... I am doing some work on automated benchmarking of systems to track performance changes/regressions and I came across a paper a couple of weeks ago. Whilst I won't claim to understand it all yet it might be of some use.

A snippet from the abstract: 'we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. ... We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.'

"Rigorous Benchmarking in Reasonable Time" https://kar.kent.ac.uk/33611/7/paper.pdf

Right, now back to lurking until I have the time to make a proper contribution :-)

lukego commented 8 years ago

@tobyriddell Thanks for the link! Please feel welcome to braindump about performance testing work here :).

I have skimmed the paper. Looks fancy :) I don't immediately grasp the details but I think that I understand the problem they are working on i.e. optimizing use of test resources (e.g. hardware) by running exactly the number of tests needed to reach an appropriate confidence interval.

I am cautious about this fancy approach for two reasons:

First, are they optimizing for answering predefined questions at the expense of post-hoc data exploration? I want to use the datasets both for hypothesis testing and also hypothesis generation i.e. poking around interactively in the dataset to look for interesting patterns ("I wonder if CPU microarchitecture explains some of the variation in IPsec scores?" etc). So there is at least some value for me in generating a large and regular dataset.

Second, I really like the idea of having a clean division oflabor between different skill sets:

If these activities are well separated then you only need one skill set to contribute to the project. For example, if R and statistics are your thing then it is nice to be able to simply work on the CSV files without having to operate the CI infrastructure.

End braindump :).

lukego commented 8 years ago

@kbara Braindump... putting on the "characterizing system performance in real world terms" hat for a moment.

Quoth [R for data science]():

The goal of a model is to provide a simple low-dimensional summary of a dataset

I submit that describing the performance of an application is a statistical modeling problem. The goal is to define a model that accurately predicts real-world performance. The model is simply a function whose inputs are relevant information about the environment (hardware, configuration, workload, etc) and whose output is a performance estimate. The goodness of the model depends on how easy it is to understand and how accurately it estimates real-world performance.

So how would this look in practice? Let's take the simplest possible model and then make some refinements.

Model A: Magic number

This simple model predicts that the application always performs at "line rate" 10G.

function A ()
   return 10.0 -- Gbps
end

This model is very simple but there is no evidence in the formulation to suggest that it is accurate.

Model B: Symbolic constant

This model predicts that the application performance is always the same, regardless of environment, but uses a symbolic constant rather than a literal magic number.

function B (k)
   return k.Gbps -- 'Gbps' value in table of constants k
end

The constant k.Gbps could be empirically calculated e.g. as the average result of all benchmarks. This could be done as part of the release process with the result included in the release notes.

Model C: Linear with processor speed

This model takes hardware differences into account by estimating that performance will be linearly proportional to the CPU clock speed.

function C (k, e)
   return e.GHz * k.bitsPerCycle
end

Here the end-user supplies the value e.GHz based on their proposed deployment hardware and the software release notes contain the constant k.bitsPerCycle. Together they tell you the expected performance e.g. if e.GHz = 2.4 and k.bitsPerCycle = 2 then the performance estimate is 4.8 Gbps.

Here the performance curve is modeled as a straight line that increases with GHz. The constant k.bitsPerCycle is the slope of the line. The test suite could automatically calculate this constant for a given release by running benchmarks at many clock speeds and performing a linear regression to find the best-fitting gradient. (This would be a simple one-liner in R.)

Model D: Many factors

This model is more elaborate and taking a step towards practicality.

function D (k, e)
   local per_core = e.GHz * bitsPerCycle(k, e)
   return per_core * math.pow(e.cores, k.multicoreScaling)
end

-- Return the number of bits processed per cycle.
--
-- Performance depends on whether IPsec is enabled and if so then also
-- on the CPU microarchitecture (because Haswell AES-NI is twice as
-- fast as Sandy Bridge).
function bitsPerCycle (k, e)
   if     e.ipsec and e.haswell then return k.fastIpsecBitsPerCycle
   elseif e.ipsec               then return k.slowIpsecBitsPerCycle
   else                              return k.baseBitsPerCycle end
end

Here the user supplies this information:

And the test suite calculates these constants:

This would be cool, huh? You could really understand a lot about the performance of a given release just by looking at those constants e.g. how close is the scalability to linear, how expensive are the features, how important is the choice of CPU, etc.

I am not really sure if the R side of this would be easy, hard, or impossible. I suspect it would be straightforward by porting that function to R and feeding the benchmark CSV file into some suitable off-the-shelf "nonlinear regression" algorithm, but maybe I am being hopelessly naive.

If we would really be working with R at this level of sophistication then we could start to refine the model iteratively. We would compare actual results with predicted results and check for patterns in the "residuals" that indicate important details that our model has missed. This may lead us to refine the model to make better predictions e.g. allowing for e.GHz to have a nonlinear effect (e.g. on a memory-bound workload like lwAFTR with a huge binding table) or including e.bitsPerPacket to account for small packet workloads being more expensive than large packet ones.

Wrapup

Just now I like the dream of using statistical modeling to understand application performance. The model would then represent our actual understanding of how an application behaves in different situations. This could then be communicated to end-users in whatever is the most appropriate way e.g. a table summarizing the most important workloads, an Excel macro that calculates the performance estimate for a given software release, or ... etc.

End brain dump!

kbara commented 8 years ago

Interesting, I've been seeing it in an almost entirely different set of ways. :-)

(The following are musings on performance; I see correctness and being able to eliminate test failures as even more important, but that's another post entirely - and the current tooling is a big step in that direction already.)

To me, 'line rate' is a first class node in my mental model: "on reasonable hardware X and card Y, does app network Z achieve line rate with packet size/distribution alpha"? This leads me to think of comparing two branches or configurations in a handful of ways (I was hashing a bit of this out with someone with a deep stats background a couple weeks ago):

a) Raw averages: which one is better? The graphs so far are good at showing this. b) The autocorrelation of the tests results. IE, if we have 70% of tests failing, and we commit a fix that makes that 0% but halves the speed, 'a' (with failures normalized to zero) would show this as an improvement - which it is - but I'd still be extremely interested in knowing about the speed drop to see if there was any reasonable way to get the best of both worlds. Similarly, bimodal performance seems to come up quite often; I want to be able to understand and characterize that, reduce variance, increase the performance of the worst cases, etc. The graphs partly show this; we could also directly apply run lengths (googling for 'run length' and 'autocorrelated' seems to bring up a lot of paywalls.)

Questions I'd concretely like to answer include:

A bit more pie-in-the-sky-ily, I could imagine using tooling like this to optimize things like the number of packets per breath to particular hardware, automatically, if that seemed like the kind of parameter that performance was actually sensitive to across different hardware. I see this kind of tooling as being able to give us really rich information about what matters, and whether it always matters in the same way. It's easy to overlook how much tuning constants can matter - and we can take a lot of the guesswork out of it, and empirically see how it works on different hardware.