Add performance testing

Strilanc commented 6 years ago

Given that we've now had our first proper performance bug (kak_decomposition being very slow), I think it's time that we added some kind of performance benchmarking process that can be run by the continuous integration scripts.

On another project, the way I did this was by decorating tests with a "performance goal" like this:

@perf_goal(avg_micros=50, args=[cirq.random_unitary_matrix()])
def test_kak_decomposition(matrix):
    return cirq.kak_decomposition(matrix)

And the test would fail if the code ran slower than the goal (and warn if it ran significantly faster than the goal). One particularly nice aspect of this approach is that the decorator can use the expected time to decide how many repetitions to perform, in order to balance statistical noise against total running time.

Overall, this issue involves:

Implementing the @perf_goal decorator (or some other mechanism for specifying performance tests)
Coming up with a convention for where we put performance tests (e.g. file_name_perf.py)
Hacking pytest's test discovery to ignore/include performance tests as appropriate, etc.
Checking that the performance of code running on Travis-CI is actually stable enough for this to be meaningful (that being said, we don't really need high precision measurements; this is just to put pressure against gradual decay).

A particularly tricky issue here is that performance bugs introduced by PR A may start causing flaky failures in PRs B-Z. It's not clear that we should block merging due to performance bugs, but then it's also not clear how we incentivize actually fixing the issues.

vtomole commented 6 years ago

Can we have CirqBot create a new issue for every performance test that fails? Maybe for each pull request that breaks performance tests; as there could be too many issues created with the first approach.

Strilanc commented 6 years ago

That's a decent idea. Test performance on PRs, but only really test it on master commits. When one of the perf tests starts failing open an issue (or update existing issue).

vtomole commented 6 years ago

Assuming that the assert statement is in the decorator, If

@perf_goal(avg_micros=50, args=[cirq.random_unitary_matrix()])
def test_kak_decomposition(matrix):
    return cirq.kak_decomposition(matrix)

fails, pytest will show that it failed in the decorator instead of in test_kak_decomposition(). How do you get around that?

Strilanc commented 6 years ago

The decorator should produce a new test method that performs the actual testing. The new test method can also output stats such as what time was actually measured (since you usually want to hover somewhere around half of the goal due to statistical noise).

Strilanc commented 6 years ago

Something like this:

def perf_goal(avg_micros=0, ...):
    def decorate(func):
        def wrapped():
                assert duration(func, ...) < goal

        return wrapped
    return decorate

vtomole commented 5 years ago

After some thought, I believe that using pytest-benchmark: https://pytest-benchmark.readthedocs.io/en/latest/ is better than rolling our own solution. The recent pull request calls benchmark() with our tests; which could mean slower build times. This begs the question: Do we want to run the benchmarks on every build? We could separate the benchmark tests and run them occasionally (once a week?). This could be done by passing --benchmark-skip(skip running any tests that contain benchmarks.) to pytest-benchmark on regular builds.

Addressing the comparisons, pytest-benchmark generates a .benchmarks/ which contains information about the performance of the benchmarks in the form of .json when provided with a --benchmark-autosave. This data can be compared with future builds using --benchmark-compare. Travis doesn't persist this data, so we will need to post it to a cloud storage and wget it when needed.

vtomole commented 5 years ago

@Strilanc Having a file_perf_test.py will (almost) duplicate code. For example:

decompositions_test.py

def test_kak_decomposition(target):
    kak = cirq.kak_decomposition(target)
    np.testing.assert_allclose(cirq.unitary(kak), target, atol=1e-8)

decompositions_perf_test.py

def test_kak_decomposition(target, benchmark):    
    kak = benchmark(cirq.kak_decomposition, target)
    np.testing.assert_allclose(cirq.unitary(kak), target, atol=1e-8)

Is that fine?

Strilanc commented 5 years ago

Yes, that's fine.

dabacon commented 5 years ago

We now have performance tests, right? Should this be closed?

vtomole commented 5 years ago

Not yet. The current implementation is not reliable because we don't persist performance data between builds. What we need to do is upload the .jsons that are generated to a remote location so we may reference them and implement a script that creates an issue if incremental benchmark results fall below a threshold

dabacon commented 5 years ago

OK so now it looks like we have this working checking for regressions. Close?

balopat commented 4 years ago

@vtomole probably in a week or two we'll have a GCP project where we will be able to store these files, and I will be able to add you as a collaborator - are you interested in finishing this work then?

tanujkhattar commented 3 years ago

Currently, we have a bunch of performance tests using pytest-benchmark and the goal is to have an infrastructure where performance tests can be visualised across commits so that it is easy to monitor the trends across commits. (something similar to https://pv.github.io/numpy-bench/)

To achieve this, we can consider one of the following two options:

Airspeed Velocity

Project Link: https://github.com/airspeed-velocity/asv

High Level Overview

ASV is a performance benchmarking tool, similar to pytest-benchmarks.
We need to write performance tests in a specific format that asv recognises and it will plot nice interactive charts across commits.
We can use a custom webserver (eg: hosted on gcp) or github pages for visualisation.
Pros
Widely used & highly configurable. Eg: https://pv.github.io/numpy-bench/ (corresponding tests: https://github.com/numpy/numpy/tree/master/benchmarks/benchmarks)
Nice interactive charts out of the box.
Cons
We would need to port the already written performance tests from pytest-benchmark to asv.
Would take relatively longer to setup the configs.

Continuous Benchmark Github Actions

Project Link: https://github.com/marketplace/actions/continuous-benchmark

High Level Overview

A simple github actions that will run the existing pytest-benchmark tests, store the output in github pages branch and show plots across commits.
Example charts: https://rhysd.github.io/github-action-benchmark/dev/bench/

Pros

Quick and easy to protype and deploy. Eg: Sample prototype for cirq: https://tanujkhattar.github.io/Cirq/dev/bench/
Supports pytest-benchmarks so no need to rewrite existing tests.

Cons

Not sure whether pytest-benchmarks is sufficient for our usecase (??)
Out of the box charts are not as pretty as asv :)

vtomole commented 3 years ago

I was looking into Airspeed Velocity for this before other things stole my time so i'm a bit biased towards it. I still vote for it cause the charts look so much better than the Github actions charts. Those pretty charts are worth the bigger investment! Please bring this up at the Sync and be sure to show the Airspeed charts :grinning:.

balopat commented 3 years ago

One thing I love about ASV is that it makes it easy to see a long history of the repo with the thumbnail charts (e.g the numpy one has the first commit from 2015 - though it is not the oldest commit, they started in 2005 or something?) Did you explore how you could do that with the github action benchmark? One aspect is to know the current performance, but it would be important to see the trends over time. Is it just a matter of running a job that generates the json data historically? I guess we'll have to do that for ASV as well.

tanujkhattar commented 3 years ago

Generating historical data and plotting it is something that we'll have to figure out for the both the approaches. Once setup, both the approaches would automatically generate the performance data for every commit on master and show the numbers on the plots over time.

For github actions, they have a max-items-in-chart parameter that can be used to control how much history we want to preserve for each chart. Also, for github actions, charts are rendered using Chart.js and it should be possible to modify the index.html to render better charts using the same data, if needed.

The bigger question probably is that which one of pytest-benchmarks or asv do we want to use as a testing framework.

quantumlib / Cirq