quaquel / EMAworkbench

workbench for performing exploratory modeling and analysis
BSD 3-Clause "New" or "Revised" License
126 stars 91 forks source link

Add control for random state #62

Open jpn-- opened 5 years ago

jpn-- commented 5 years ago

Are there any plans to include "random state" interface for any of the sampling functions (i.e. to guarantee stable output for testing)? Or: was this considered and explicitly not done for some reason?

If this isn't contra-indicated, would a pull request that implements this only partially be of interest? There's a lot of places where random state might be useful, and I don't have time to search through the whole codebase and find/implement them all, but I'd be willing to do so for a handful of "low hanging fruit" places that would be useful for the project I am working on. (I am going to do this anyhow, just want to know if I should isolate this as an independent PR)

quaquel commented 5 years ago

can't this be achieved by simply setting the numpy random seed globally?

I agree it would be a useful features

quaquel commented 5 years ago

So, I quickly checked the code. Sampling relies entirely on scipy.stats or numpy.random. Scipy.stats uses the global numpy random state, so setting the global numpy random state should be sufficient for reproducibility. Will try to test this Thursday.

Or do you want to control the RandomState of individual uncertain variables?

jpn-- commented 5 years ago

I was thinking for thread-safe operations, and to potentially regenerate individual random variables at a later time or on a different machine. This is pretty common for some of the other large scale simulators I work on, as there's a desire to have reproducible results not just in unit tests but in practical applications as well.

quaquel commented 5 years ago

I completely agree with the use case.

Thread safety is no concern for sampling: it only happens in the main thread. Farming out to other processes / threads happens after generating the experiments.

So the real question is whether having a global solution (probably via setting the numpy RandomState) is sufficient or whether you want to control individual distributions. The latter would require a modification to the Abstract Parameter class (init, and from_dist).

quaquel commented 5 years ago

So I ran a quick test using the inter temporal version of the lake problem and setting the global numpy random state. This works: it produces the exact same results. See the attached pdf for a quick proof of principle: seed_test.pdf

Is this sufficient for your use case or not?

jpn-- commented 5 years ago

I've identified my real problem after pulling through the depths of the code -- and it turns out my real problems are with Platypus and not workbench. I don't have any desire to put in the effort to change that code as well, so I'm going to call this solution is good enough.

For posterity should any other users need to address this issue in the future:

At least part of the issue is that platypus uses the Python standard library random number generator, while the workbench uses the numpy random number generator. If you want any hope of generating more or less reproducible results when the workbench reaches into Platypus, you need to initialize both global random number generators:

import numpy.random
import random
numpy.random.seed(42)
random.seed(42)
quaquel commented 5 years ago

ah, had I know you were using platypus I would have been able to tell you this right away.

We might still consider having a set_random_state function on the workbench which sets both the numpy random state, the random state, and ensures that this is properly propagated to any subprocesses that are started.

Would that be a useful feature?

jpn-- commented 5 years ago

Modestly useful, not a high priority though.

quaquel commented 2 years ago

Started looking into this. This discussion on StackOverflow is quite useful. It seems the preferred approach is

from numpy.random import Generator, PCG64

# Case 3 (IMP) : Scipy uses an existing Random Generator which can being passed to Scipy based 
# random generator object
numpy_randomGen = Generator(PCG64(seed))
scipy_randomGen.random_state=numpy_randomGen
print(scipy_randomGen.rvs(n, p, size))
print(numpy_randomGen.binomial(n, p, size))
# prints
# [4 4 6 6 5 4 5 4 6 7]
# [4 8 6 3 5 7 6 4 6 4]
# This should be the case which we mostly want (DESIRABLE). If we are using both Numpy based and 
#Scipy based random number generators/function, then not only do we have no repetition of 
#random number sequences but also have reproducibility of results in this case.

scipy.stats is being used in samplers.py, and plotting_util.py numpy.random is being used in prim.py, and samplers.py random is being used in optimization.py, evaluators.py, and points.py