snabbco / snabb

Snabb: Simple and fast packet networking

Apache License 2.0

2.98k stars 301 forks source link

CI performance tests: a scientific approach #688

Open lukego opened 8 years ago

lukego commented 8 years ago

Here is an idea for how we could upgrade our CI performance tests by taking a scientific approach.

(Note: I am not a scientist. This is stealing good ideas that I learned from working a lot with my friend Christophe Rhodes who is a programmer with a background in physics. I am sure he would write this all better than I do but hey...)

The idea is to investigate the behavior of Snabb Switch with much the same scientific process that physicists use to investigate the behavior of the universe:

Formulate hypotheses about the software.
Define testable predictions.
Run experiments to generate data.
Analyze the data and use it to verify/falsify/revise the hypotheses.

This would separate testing into separate activities that could be done by cooperating groups, much like theoretical physicists cooperating with experimental physicists cooperating with lab technicians and so on.

How it might work

Sound vague? Here is a concrete idea of how it might look:

Formulate hypotheses:

Snabb Switch performance is equal or better in each release.
Snabb NFV delivers 10 Gbps of bandwidth per port.
Snabb NFV is compatible with all versions of QEMU, Linux, FreeBSD, and DPDK.
Snabb NFV performance is independent of negotiated Virtio-net options.
Packetblaster can saturate any number of ports with 64-byte packets.

Make test cases:

Create benchmarking programs to report on relevant metrics.
Create test environments that can run any interesting software version or configuration.

Run experiments:

CI continuously runs experiments and recording results.
CI samples the space of possibilities e.g. randomly chooses software versions and configurations. There could be a billion possible combinations of software and configuration and we may only be able to run a hundred tests per day: our job is to generate interesting data.
CI outputs a CSV file containing the results of every test ever done, including all necessary metadata (dates, hardware platforms, software versions, etc).

Analyze results:

Feed the CSV file into tools like R, Torch, Gnuplot, Excel, etc.
See which hypotheses hold and which ones don't.
Explain the results:
- "Performance dropped in v2015.09 with DPDK >= 1.8 and virtio-net indirect descriptors."
- "Packetblaster cannot sustain line rate with NICs on multiple NUMA nodes."
- "Jumbo frame tests failed with DPDK 1.7."
- "Snabb Switch performance became much more consistent between releases since June 2015."
  What tools we might use

Hypothesis: work these out with pen and paper.

Test cases: Nix expressions to define the software, configurations, and test scripts to run.

Experiments: Hydra to automatically create a queue of tests to run, work through them, and archive results.

Analysis: Complex stuff by hand + simple stuff detected by SnabbBot and reported to Github as it happens.

End

End braindump! What do we think?

This may sound like overkill but actually I suspect it is the only way that really works. Manual testing tests to miss important cases, see patterns that aren't there when there is variation, and take an immense amount of time to produce a small amount of data.

Have to talk with Rob Vermaas about how closely this relates to the Hydra-based performance CI system that they have deployed for their LogicBlox database.

cc @eugeneia @domenkozar

lukego commented 8 years ago

I suppose that I didn't really state the problem...

We do already have a solid CI system including performance regression tests. This checks every PR before merge and defends against regressions.

This idea is about exploring the universe of possible tests see what we can discover. It may actually run separately from our existing CI-for-PRs, as a background process that generates actionable data.

Just to quantify a bit, suppose we were interested in testing with all of these aspects:

10 different test cases.
5 versions of QEMU.
10 different guest VMs (Linux and DPDK).
16 combinations of Virtio-net options.
2 NUMA setups ("good" and "bad")
2 polling modes (engine "busy loop" and sleep/backoff)
2 error recovery modes (engine supervising apps vs process restart)
2 C libraries (glibc and musl)
3 CPUs (Sandy Bridge, Haswell, Skylake)

Then if we would enumerate all of the tests we would have 10 * 5 * 10 * 16 * 2 * 2 * 2 * 2 * 3 = 384,000 test scenarios.

If each test would take one minute then it would take more than six months to run every scenario, and this would increase exponentially as we added more interesting variables to the tests (IOMMU setting, NIC vendor, ...).

Alternatively if we could sample the test universe in some suitable way then we could be adding over a thousand results to our database each day and use these to answer interesting questions:

How sensitive is each benchmark to CPU microarchitecture?
... to NUMA (non-)affinity?
... to particular software version combinations?
which engine features should we enable by default?
do we have any option-dependent bad cases in our virtio-net performance?
on which kernel versions do we observe problems with iommu enabled?
how much variation is there in test results that we can't currently account for?

Clever people could answer such questions definitively just by looking at the CSV files that we provide -- or at least tell us what additional data we need to collect to provide an answer.

lukego commented 8 years ago

Here is an example of the kind of analysis that can be done using R to process the CSV data: Exploratory Multivariate Analysis. This kind of analysis is a simple everyday thing for many people and it does not necessarily even require much information about what the data means or how it is generated. Just a matter of finding relationships between variables e.g. what combinations predict high performance, low performance, failed tests, etc.

petebristow commented 8 years ago

Sounds like a good plan have you done much work on making it a reality ? Are all the existing benchmarks based around the scripts in the bench/ directory ? I think adopting a standardized benchmark run 'data format' would be good, json?. The existing benchmarks just give a pps value as output. Have you done any work on this ? I've got some preliminary work on a 'benchmark' app that works in a similar style to 690, allowing embedded micro benchmarks for each app. I've used it to start looking into questions such as

What is the base pps cost of an app? (that does nothing but transmit(output, recieve(input)))
What does an app cost compared to composable function calls?
Do large numbers of idle links have a material pps impact on the system as a whole?

All my uses have high pps with small packets and none use VMs so it would be great if whatever goes forward isn't completely nfv centric.

lukego commented 8 years ago

@petebristow Cool!

Great that you are working on factoring out test and benchmark code to make it easier to use systematically. This is really valuable stuff and AFAIK nobody else is working on that now. I really relate to the questions you have formulated and the parameters to explore e.g. to see the impact of engine parameters. I would love to be able to simplify the engine e.g. by removing configuration knobs where performance tests can demonstrate that there is one setting that always works well (e.g. with engine.Hz).

Related activities:

First we have Continuous Integration for Snabb Switch i.e. SnabbBot that has been diligently benchmarking every PR that is submitted to Github. SnabbBot has an archive of more than a thousand test results stored in Gists. You can see the performance test results at the start of these logs:

Checking for performance regressions:
BENCH basic1-100e6 -> 1.0055 of 14.54 (SD: 0.185472 )
BENCH packetblaster-64 -> 1.00019 of 10.584 (SD: 0.0162481 )
BENCH snabbnfv-iperf-1500 -> 0.906008 of 5.426 (SD: 0.471703 )
BENCH snabbnfv-iperf-jumbo -> 0.974238 of 6.366 (SD: 0.239633 )
BENCH snabbnfv-loadgen-dpdk -> 1.00709 of 2.652 (SD: 0.0147919 )

This system is available today, it's easy to add more benchmarks to, and each benchmark will automatically be checked for regressions and its results stored on Github. This is the base that we are building on.

The main part that is not covered today is to define a large number of parameters and to have the CI explore them in a systematic way. This is complicated for applications like NFV that have complex dependencies that can be large/slow to build. The test framework is not specific to NFV/VMs but it does have to support that application.

Now moving forwards with testing more combinations on more servers I am keen to take some time to explore references like Setting up a Hydra build cluster for continuous integration and testing to see if there are some kindred spirits who have already developed the kind of tooling that we need. The Nix community particularly seem to have excellent taste and to have invested a lot of creativity in the problem of "build and test". I am talking with @domenkozar about bringing up a Hydra instance for Snabb Switch that we can experiment with.

domenkozar commented 8 years ago

Here are my 2 cents on the topic.

As @lukego stated in the first comment, testing all possible scenarios (different versions of software we integrate to) means an explosion of possibilities.

Regression tests

The alternative way, hopefully easier to pull off, is to fixate time and use regression tests (with a CI) against the whole collection of software.

Thinking about all inputs for the test suite for a snapshot of all software. Then we update snabb and collection of software used for integration run all regression tests.

If we find a regression in performance, we know that snabb changed from commit X to Y OR collection of software changed from Z to W. If we don't bump both at the same time, it's easier to pinpoint what commit is faulty for the regression. Once we have one range of commits, we just git bisect them using the test that introduces a regression.

If we wanted we could still have different sets of software against which we could run the tests. For example, two versions of OpenStack. Or more generally, the exact versions that will/are be used in production.

Example

Tests for the Nix itself are a good example. Looking at http://hydra.nixos.org/jobset/nix/master you'll see that there are just two moving targets under "Input changes" column. nix and nixpkgs (software collection) repositories.

If the tests executed provided a formatted output of performance results, some other script/entity could collect those and figure out if there was a regression.

Hydra has this functionality partially implemented. For example, we graph the closure size (whole runtime dependency tree) for EC2 instances to see if something caused the size to grow more than we'd like to.

In this case, if someone bumped Qemu in Nixpkgs, once we pinned a newer commit, the regression would be detected by our tests. We should strive to do these updates in smallest possible intervals so it's easier to pinpoint the problems (ideally for each commit, but that's an overkill).

Pros

provides an easy way to chase down what commit exactly introduces the regression
doesn't interfere with software development workflow
most of the infrastructure is already available (we'd need to automate

Cons

we need a good collection of regression tests, one for each functional case of software that should be performance
it doesn't cover regressions before this system is in place

lukego commented 8 years ago

@domenkozar Good description. I'd say that you are more-or-less describing the CI that we already have. This is effective for catching regressions on a small set of tests with one reference platform.

Now I want to take the next stop beyond this and start searching for problems in a larger space of configurations and dependency versions ("scientific testing" until someone points out a better name). Goal is to be able to show, based on data, how well a given Snabb Switch release works with different CPUs, network cards, Linux distributions, virtual machines, and so on.

The same mechanism may work for both: CI that tests a 2-tuple of snabb+environment. The difference would be that for basic regression tests there would be one environment whereas for scientific tests there would be (say) a thousand environments chosen pseudo-randomly from a space of a billion possibilities. Could be that we setup a Hydra that has two Git repos as input and that another process (SnabbBot) is creating random permuations of environments and pushing them into Git for testing, for example?

Zooming out a bit...

For analogy suppose that we were developing a web browser like Firefox and we were taking these two approaches to testing:

For regression tests we would test very specific things: for example load the same content from the same apache version and simulate the same UI events and assert that the browser products the expected image (screenshot). Or feed a hundred programs to our JavaScript engine and check that they all produce the expected result. And so on. These tests would tell us that our changes are not breaking obvious things.

The scientific tests would be to tell us when the tests are breaking subtle things in the real world. This requires that the test environment has a comparable amount of variation to the deployment environment. For example, in searching for problems we would want to test with many different independent variables like:

Website to load: chosen from one of the million most popular ones.
Operating system: Linux, OSX, Windows, iOS, Android plus various sub-variables (corporate/OEM additions, localization, virus scanners, etc).
IPv4/IPv6/dual-stack.
Geographic region.

This kind of test coverage would be needed to ensure that we can ship a major new release to millions of people with confidence that it will have fewer problems than the previous one. People who really do build browsers can perhaps gather this data in semi-automated ways, e.g. by having a thousand people download a nightly build that is instrumented to report back to upstream, but for network equipment we don't really have that luxury and I think we need to generate the variation ourselves (or else we will be forever on the back foot reacting to bug reports without context).

Coming back to topic...

Today our CI tells us that the software is working and performing as expected with simple configurations and running on a reference platform modelled on the latest Ubuntu LTS. This means that we can ship new releases and people can deploy them on that platform. However, we need to do much better than this. There are a tremendous number of configurations, platforms, VMs, etc that people can use. I want the CI to keep track of how well we are working with all of these. I want to avoid the situation where most users have a bad experience because they deploy with software different than our CI.

plajjan commented 8 years ago

Then if we would enumerate all of the tests we would have 10 * 5 * 10 * 16 * 2 * 2 * 2 * 2 * 3 = 384,000 test scenarios.

The last "3" representing three different CPUs probably means that you have three machines, thus these machines can run the tests in parallel and you end up with 128k tests, not 384k. With a minute per test it's 88 days. Do we run tests in parallel on the same box? If you could run 10 tests in parallel you have reduced running time to ~9 days, which is actually manageable. This could of course have negative impact since concurrent processes compete for same cache and so forth. I imagine the scientific CI to run in the background and whenever it identifies "problematic combinations", ie a certain combination of inputs that diverges from a "baseline" (some average of all runs!?), you could a) have a human look at it and/or b) add it to SnabbBot so that this combination is then run for all subsequent commits.

Maybe you don't need to run this type of testing for every commit/PR but instead focus on every release. If that's the case, you actually have a few months to run a complete suit. Or you run it as fast as it can, ie if a run takes 9 days then you start the next run with whatever code has landed after those 9 days.

If you add more parameters or if you are unable to run tests in parallel then it certainly becomes more attractive to just do random sampling of the whole combination space.

For the more exhaustive option maybe some parameters can be excluded, ie if results are virtually identical over all versions of qemu then most of those versions can simply be removed from future testing.

lukego commented 8 years ago

Just another thought to throw onto the pile, following @plajjan's train of thought...

It would be neat if we could run these tests in "SETI@HOME" style. That is: if we have 10-20 servers in our lab we could have them detect when they are idle and automatically start running tests e.g. in a container that can be instantly aborted if a developer starts working on that machine.

also wondering whether we can split this whole problem into several independent parts:

Define a test environment with many variables (software versions, configurations, etc) and run it.
Archive test results in a convenient place and with enough metadata for analysis.
Execute bulk tests in a continuous way with automatically chosen variable settings.

snabbco / snabb

CI performance tests: a scientific approach #688

How it might work

What tools we might use

End

Regression tests

Example

Pros

Cons