Closed eb8680 closed 6 years ago
@neerajprad we should probably revisit this now that we're adding more algorithms. Should we keep this issue open as a tracker or close it and open more specific ones? E.g. in #622 @ngoodman requested that we port some MCMC tests from webPPL.
Should we keep this issue open as a tracker or close it and open more specific ones?
Most of the action items from V0 are done. We can close this and track the remaining task of refactoring integration tests / putting in new tests for MCMC algorithms in #634.
(This is a meta-issue for discussing the overall architecture and goals of the new test infrastructure, making a roadmap, and for tracking progress on its constituent issues. I'll continue filling it in and adding other issues but posted so we can discuss sooner. Also collecting all issues in this GitHub Project)
Architecture
After some reading and several discussions, especially with @ngoodman and @neerajprad , we settled on something like this (note many of these exist in some form in the current tests, but will need refactoring):
Deterministic tests
Pyro instrumentation
The Pyro-specific tests should sit on top of and reuse tools for monitoring, evaluating, criticizing, and visualizing Pyro models. We can discuss those tools and their interfaces in separate issues:
Stochastic test platform/library
See #101 and also this blog post. Basically, although we're particularly interested in testing our inference algorithms, down the line there will be many other stochastic programs we're interested in testing hypotheses over. It seems like we should be able to build a lightweight platform to handle this more general use case.
This platform should be able to:
Stochastic unit tests
Pyro has many small stochastic components that should be tested individually:
map_data
: #93Each of these requires a different set of bespoke tests implemented on top of the test platform.
Pyro stochastic integration tests
Most of the current Pyro tests are actually integration tests: that is, they run inference algorithms with a particular model and guide and compare empirical and ground-truth posterior statistics to decide whether the test passes or fails. We want to make this more systematic and make the results less noisy and more useful. We also want to test runtime of some examples against previous versions to monitor performance regression.
To that end, integration tests should be generated automatically from the following components:
V0
There's a ton of work to do here, but fortunately we're not that far away from a minimal working prototype of the whole thing that should solve a lot of our immediate problems.
Basically, I think we need to:
map_data
and gradient estimators that directly test for correct behavior of those features