Repeated multiprocess simulate produces the same answers every time

axch commented 7 years ago

Test program:

import struct
import bayeslite

seed = struct.pack('<QQQQ', 0, 0, 0, 2)
bdb = bayeslite.bayesdb_open("foo.bdb", seed=seed)

bdb.sql_execute("CREATE TABLE IF NOT EXISTS test_t (x)")
bdb.sql_execute("INSERT INTO test_t VALUES (1)")
bdb.execute("CREATE POPULATION test_p FOR test_t WITH SCHEMA (MODEL x AS NUMERICAL)")
bdb.execute("CREATE METAMODEL test_m FOR test_p WITH BASELINE crosscat")
bdb.execute("INITIALIZE 1 MODELS FOR test_m")

for _ in range(10):
    print bdb.execute("SIMULATE x FROM test_p GIVEN rowid = 1 LIMIT 3").fetchall()

Doing this produces output like this:

[(3.0227014789924476,), (-0.5771918878776363,), (-0.059509389533395185,)]
[(3.0227014789924476,), (-0.059509389533395185,), (-0.5771918878776363,)]
[(3.0227014789924476,), (-0.5771918878776363,), (-0.059509389533395185,)]
[(-0.059509389533395185,), (-0.5771918878776363,), (3.0227014789924476,)]
[(-0.059509389533395185,), (3.0227014789924476,), (-0.5771918878776363,)]
[(-0.059509389533395185,), (-0.5771918878776363,), (3.0227014789924476,)]
[(-0.059509389533395185,), (3.0227014789924476,), (-0.5771918878776363,)]
[(3.0227014789924476,), (-0.059509389533395185,), (-0.5771918878776363,)]
[(-0.5771918878776363,), (3.0227014789924476,), (-0.059509389533395185,)]
[(-0.059509389533395185,), (3.0227014789924476,), (-0.5771918878776363,)]

Note that, while the order of the outputs varies, the actual numbers present are always the same. I would expect repeated simulate queries to produce different answers.

Useful further information:

Changing the initial seed changes what the numbers are, but does not change the pattern that they are always the same
If one gets rid of the default metamodels and instead installs a non-multiprocess version of CGPM, the simulation results become different from one another, so that constitutes a workaround
By adding debug prints, I narrowed the problem down to cgpm.crosscat.engine.Engine.simulate: the _set_seeds() method appears to produce different seeds on each invocation, as would be expected, but then the returned samples are still all the same (and, at this point, in a consistent order, so their reordering must occur elsewhere).

fsaad commented 7 years ago

I'll bet if you turn off multiprocessing with bdb.metamodels['cgpm'].set_multiprocess(False) the samples will no longer be identical.

Cause of the isseu: _set_seeds sets the seed of each crosscat.State in the Engine, but the Dim and View objects inside the state do not have their rngs updated recursively (a bug). When calling engine.simulate with multiprocess, the rngs of the crosscat.State objects in the child process will have their rngs modified, but this rng is not pushed back back up to the master process, only the samples themselves are returned (which was the original intention of doing the expensive set_seeds operation, so that each crosscat.State has a fresh rng on calling simulate).

Solution 1: Rather than update the state.rng attribute here call a function state.set_rng which will take care to propogate the rng through all the nested cgpms that comprise the state.

Solution 2: Make engine._evaluate return a tuple of the answer plus the new rng, to avoid creating rngs in the Engine over and over.

axch commented 7 years ago

Stupid question: Why do the Dim and View objects store their own rngs at all?

axch commented 7 years ago

Separately from that, solution 3 would be to refactor State, Dim, and View to each store not an rng, but a reference to a mutable box containing an rng, such that they are all shared and can be efficiently mutated in O(1) by _set_seeds. Down side: shared mutable state.

fsaad commented 7 years ago

Stupid question: Why do the Dim and View objects store their own rngs at all?

It's not a stupid question. Basically Dim and View objects may (and often do) exist in isolation, without the assumption that Dim is embedded in View, and View is embedded in State (which holds the rng). I had two choices when deciding how to supply entropy to Dim/View (and even State); one option was to give each method which needs randomness an explicit rng or seed parameter, and the second option was to supply an rng object in __init__. I selected latter option, since it made the interface easier and less expectations on the client to give an rng on a per-call basis.

axch commented 7 years ago

Interesting. I regularly find myself making the opposite choice for the same reason -- including the rng in the interface makes it clear which methods might be random, and eliminates a piece of mutable state. Which mutable state is the cause of the present difficulty.

fsaad commented 7 years ago

I agree, and I probably would have done the same as your choice implementing rng as a parameter had I been making this choice a second time round.

fsaad commented 7 years ago

A third option is to provide each CGPM with a reference to an rng e.g. in a box.

riastradh-probcomp commented 7 years ago

I highly recommend passing independent seeds, rather than RNG boxes, into the interfaces in question. No need to coordinate state (which may be tricky if subprocesses are involved!); subtrees can be recomputed independently; simpler interface itself, at the cost of a modicum of extra work at either side of the interface to draw a seed and seed an RNG.

fsaad commented 7 years ago

Alright, how should the function use the seed? Certainly it should not create a new rng object using that seed? I guess it would set the seed of an embedded rng using

https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.seed.html#numpy.random.RandomState.seed

Is setting the seed an expensive operation?

riastradh-probcomp commented 7 years ago

In a good RNG like https://mumble.net/~campbell/python/weakprng.py (which we use in some places) setting the seed is practically free.

In numpy.random it's not free, but unless it's observed to be a bottleneck we just do that anyway. (Mostly it has not been a bottleneck, but there were a couple cases where we just deferred creating the RNG until we proved we needed it, and that made the bottleneck go away.) Can't imagine it costs anywhere near what fork does anyway, if you're doing it across a process boundary.

fsaad commented 7 years ago

The fix for this issue #217 is therefore to change https://github.com/probcomp/cgpm/blob/20170124-fsaad-engine-seed/src/crosscat/engine.py#L243 to state.rng.seed(n), which will propagate down to all sub-cgpms recursively without any additional work since the rng is a shared object and essentially a box itself.

A separate ticket should be filed for refactoring the interface to accept seed arguments

probcomp / cgpm

Repeated multiprocess simulate produces the same answers every time #217