Test infrastructure overhaul

eb8680 commented 7 years ago

(This is a meta-issue for discussing the overall architecture and goals of the new test infrastructure, making a roadmap, and for tracking progress on its constituent issues. I'll continue filling it in and adding other issues but posted so we can discuss sooner. Also collecting all issues in this GitHub Project)

Architecture

After some reading and several discussions, especially with @ngoodman and @neerajprad , we settled on something like this (note many of these exist in some form in the current tests, but will need refactoring):

Deterministic tests

Poutines: each individual poutine and each poutine composition appearing in other Pyro code should be tested directly for expected behavior.
Parameters and parameter store: we should check that parameters are registered, retrieved, and flushed correctly
Serialization: we should check that the parameter store can be serialized and deserialized correctly and without loss of information.
Other utilities: Pyro has many other small deterministic utilities that play critical roles (e.g. the histogram builders in #61 ) and should be tested individually

Pyro instrumentation

The Pyro-specific tests should sit on top of and reuse tools for monitoring, evaluating, criticizing, and visualizing Pyro models. We can discuss those tools and their interfaces in separate issues:

Visualization: #20
Evaluating and criticizing:
Logging/storage:
Profiling:

Stochastic test platform/library

See #101 and also this blog post. Basically, although we're particularly interested in testing our inference algorithms, down the line there will be many other stochastic programs we're interested in testing hypotheses over. It seems like we should be able to build a lightweight platform to handle this more general use case.

This platform should be able to:

Test black-box hypotheses by estimating p-values from pass/fail sample frequencies
Handle multiple testing (tests with correlated results)
Parallelize/distribute tests and run tests according to an arbitrary dependency graph with optional early stopping at failures
Run tests for varying amounts of time and with varying triggers
Test code in one git commit against another

Stochastic unit tests

Pyro has many small stochastic components that should be tested individually:

Distributions: #16
Gradient estimators: #84 and others
map_data: #93
Marginal likelihood estimators and other (partially) stochastic functions, e.g. analytic KL divergences, CUBO: #91 #41 and others

Each of these requires a different set of bespoke tests implemented on top of the test platform.

Pyro stochastic integration tests

Most of the current Pyro tests are actually integration tests: that is, they run inference algorithms with a particular model and guide and compare empirical and ground-truth posterior statistics to decide whether the test passes or fails. We want to make this more systematic and make the results less noisy and more useful. We also want to test runtime of some examples against previous versions to monitor performance regression.

To that end, integration tests should be generated automatically from the following components:

Model
Guide
Data
Algorithm
Test/hypothesis/experiment
Setup/configuration

V0

There's a ton of work to do here, but fortunately we're not that far away from a minimal working prototype of the whole thing that should solve a lot of our immediate problems.

Basically, I think we need to:

[ ] implement black-box hypothesis tests and a simple multiple-testing protocol (e.g. Bonferroni)
[x] split the tests up to run in parallel #117 #114
[ ] refactor the existing integration tests along the lines of the above structure, unify the current inference, tracegraph, map_data, and sampling integration tests, and generate them automatically ( #123 )
[x] add some additional poutine tests to handle edge cases from algorithms
[x] add some stochastic unit tests for search and importance sampling marginal likelihood, map_data and gradient estimators that directly test for correct behavior of those features
[x] refactor distribution tests to be generated automatically so we can be more confident about semantic coverage and correctness of tests
[ ] split off some more expensive tests (e.g. the big conjugate models and the runtime tests) to run as Travis cron jobs outside the CI suite

eb8680 commented 6 years ago

@neerajprad we should probably revisit this now that we're adding more algorithms. Should we keep this issue open as a tracker or close it and open more specific ones? E.g. in #622 @ngoodman requested that we port some MCMC tests from webPPL.

neerajprad commented 6 years ago

Should we keep this issue open as a tracker or close it and open more specific ones?

Most of the action items from V0 are done. We can close this and track the remaining task of refactoring integration tests / putting in new tests for MCMC algorithms in #634.

pyro-ppl / pyro