stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.57k stars 368 forks source link

sampler testing framework #318

Open bob-carpenter opened 10 years ago

bob-carpenter commented 10 years ago

We can't get new patches into samplers because there aren't any reliable tests.

We need tests for the samplers for

We also want to test things that Michael has suggested for HMC like

We have to make all these sensitive to the fact that we have MCMC.

betanalpha commented 10 years ago

We can't get new patches into samplers because there aren't any reliable tests.

We need tests for the samplers for

• accuracy on means • accuracy on variances • speed regression tests We also want to test things that Michael has suggested for HMC like

• step size * 2 ^ tree_depth is in a range --- how often and what range? We have to make all these sensitive to the fact that we have MCMC.

We have to be careful because, by construction, MCMC is stochastic and not exactly amenable to unit tests as they are usually defined.

Mean/variance estimation:

Assuming a Monte Carlo CLT we'll still have to worry about the expected randomness. Running an ensemble of tests and only requiring the expected number pass would help, but also make the tests much more demanding.

That said, iid gaussian and a correlated gaussian are natural first tests.

Adaptation:

Some distributions undercut the usual optimization criteria that we use for adaptation. Hierarchical models like the funnel are a big example that we might want to test.

The interaction between the distributions and adaptation would require sampler-specific tests, not happy generic tests. There are some exceptions -- the gaussians mentioned above are "linear" and about as easy to adapt to as possible.

Speed regression tests:

Depends on the machine running the tests, so we can't just define definite thresholds. Is it possible to build up the testing framework to run examples using two difference tags for comparison?

bob-carpenter commented 10 years ago

On 10/23/13 4:45 PM, Michael Betancourt wrote:

We can't get new patches into samplers because there aren't any reliable tests.

We need tests for the samplers for

• accuracy on means • accuracy on variances • speed regression tests We also want to test things that Michael has suggested for HMC like

• step size * 2 ^ tree_depth is in a range --- how often and what range? We have to make all these sensitive to the fact that we have MCMC.

We have to be careful because, by construction, MCMC is stochastic and not exactly amenable to unit tests as they are usually defined.

Right. That's why, for example, the RNG tests that Peter wrote do a very large number of samples and then use a very liberal threshold for a chi-square test. We have a classical multiple testing problem where we want to control the false positive rate.

This is similar to what Andrew calls the "Cook-Gelman-Rubin" approach.

Mean/variance estimation:

Assuming a Monte Carlo CLT we'll still have to worry about the expected randomness. Running an ensemble of tests and only requiring the expected number pass would help, but also make the tests much more demanding.

Right. That's what we're doing for the RNGs, but those are much simpler to run multiple times.

That said, iid gaussian and a correlated gaussian are natural first tests.

We mostly want to have tests in place to make sure we didn't mess anything up badly. Finer-grained performance testing can't be part of our "unit testing" framework. (Though I do believe Jenkins currently reports total time for all the tests in a browsable way, not that I've ever browsed it.)

Adaptation:

Some distributions undercut the usual optimization criteria that we use for adaptation. Hierarchical models like the funnel are a big example that we might want to test.

The interaction between the distributions and adaptation would require sampler-specific tests, not happy generic tests. There are some exceptions -- the gaussians mentioned above are "linear" and about as easy to adapt to as possible.

We already have tests that vary configuration (e.g, for number of iterations) for different models.

Speed regression tests:

Depends on the machine running the tests, so we can't just define definite thresholds. Is it possible to build up the testing framework to run examples using two difference tags for comparison?

I don't see why not. Daniel's a wizard with Jenkins.

For the foreseeable future, the machine running the tests will be the Jenkins Windows box. Our latest grant proposal applied for some more hardware for ongoing testing.

And we can test on our own machines.

betanalpha commented 10 years ago

It's not a matter of varying the parameters but figuring out how they need to be varied. Just warning that because of these interactions it will be hard to have generic "sampler" tests instead of individually-tuned tests for each sampler.

On Oct 23, 2013, at 10:22 PM, Bob Carpenter notifications@github.com wrote:

We already have tests that vary configuration (e.g, for number of iterations) for different models.

syclik commented 10 years ago

I'm ok with individually-tuned tests for each sampler.

betanalpha commented 9 years ago

Testing framework proposed in https://github.com/stan-dev/stan/tree/feature/stat_valid_test -- currently needs to be updated so that the tests can be run without depending on CmdStan.

syclik commented 7 years ago

@bob-carpenter, this is what we were talking about doing. This will depend on #1751, so I'll branch from there as I start working.