Test some benchmark configurations against fixtures

turion commented 1 year ago

@ohad suggested to test inference algorithms for the samples they produce in https://github.com/tweag/monad-bayes/issues/245#issuecomment-1422385283. I think this is a good idea independently of the issue discussed there. So I implemented a few fixture tests for the three algorithms and the three models implemented in the benchmarks.

turion commented 1 year ago

Are all inference algorithms covered by such a test now, and with sufficiently complex models?

reubenharry commented 1 year ago

To make sure I understand (I'm probably confused): is the idea to ensure that the code always produces the same samples given the same random seed? If so, is this a property we always want? For example, what if we change the rng? Presumably the correctness of monad-bayes does not hinge on using the Mersenne twister rng (and indeed, Dominic and I made monad-bayes parametric on the choice of rng) but would this cause the tests to fail?

On this front, note that to be really safe, we might also want to check that the version of monad-bayes from before Spring 2022 (which we can be reasonably confident was in-sync with the proven correct implementation of the paper) agrees on these tests with the current version. I think I avoided making semantically impactful changes, but it's always subtle.

turion commented 1 year ago

To make sure I understand (I'm probably confused): is the idea to ensure that the code always produces the same samples given the same random seed? If so, is this a property we always want?

Yes, for changes that are not supposed to change the semantics.

For example, what if we change the rng?

Then we should isolate this change in a single commit and update the fixtures there.

Presumably the correctness of monad-bayes does not hinge on using the Mersenne twister rng (and indeed, Dominic and I made monad-bayes parametric on the choice of rng) but would this cause the tests to fail?

Yes it might. As would changing the random seed hardcoded in sampleIOFixed. But that's ok, running the tests updates the fixtures.

On this front, note that to be really safe, we might also want to check that the version of monad-bayes from before Spring 2022 (which we can be reasonably confident was in-sync with the proven correct implementation of the paper) agrees on these tests with the current version. I think I avoided making semantically impactful changes, but it's always subtle.

Ok, I can try that if it's possible to backport the tests to there.

turion commented 1 year ago

The tests are failing on darwin apparently because the precision of the floating point numbers used there is ever so slightly different. That's really annoying. Some options:

[x] Don't fixture test on darwin
[ ] Find out what is different, and use a type with system-independent precision. Might need to solve #191 first.

reubenharry commented 1 year ago

My feeling would be that for now it's worth just ignoring Darwin and merging, especially since 191 risks changing the fixtures.

turion commented 1 year ago

Ok, I've tried to disable tests on MacOS, let's see whether that works.

turion commented 1 year ago

My feeling would be that for now it's worth just ignoring Darwin and merging,

I'll interpret this as approval ;)

turion commented 1 year ago

@reubenharry can you review?

turion commented 1 year ago

@reubenharry @idontgetoutmuch can you review this PR?

tweag / monad-bayes

Test some benchmark configurations against fixtures #256