Open ejhumphrey opened 6 years ago
That's an interesting problem. I don't think the streamer or mux objects need to get involved though.
I think it can be handled effectively by passing a random_state object to the streamer (by reference), rather than a seed? That way, the state object persists across instantiations of the streamer, but the overall sequence is still reproducible. Something like:
streams = [pescador.Streamer(my_sampler, np.random.RandomState(my_seed), other_params) for other_params in ...]
You just need to write my_sampler
so that it can accept randomstate objects directly. A quick spot check verifies that randomstates can be serialized, so I don't see any problems down the line.
so I tried this (or tried to try it), and I believe that it doesn't do the trick. I believe that I observed the same RandomState
object being passed to each Sampler
when it gets activated; in other words, resetting the generator also resets the RandomState object, as I don't think it is stateful across activations.
I'll give it a looksee a little later tho to be sure, about to drop into a meeting.
believe that I observed the same RandomState object being passed to each Sampler when it gets activated; in other words, resetting the generator also resets the RandomState object, as I don't think it is stateful across activations.
The object should be the same (hash match), but its internal state should persist and advance across activations.
should, yes, but I don't think that's the behavior I observed. There was a bunch of code around that check, though, so I want to isolate that exactly and know for sure.
Also I was flying back from Europe in some kind of delirious fugue state.
Confirmed:
In [29]: def my_gen(state):
...: for i in range(5):
...: yield state.randn()
...:
In [30]: s = np.random.RandomState(1)
In [31]: streamer = pescador.Streamer(my_gen, s)
In [32]: mux = pescador.ChainMux([streamer], mode='cycle')
In [33]: list(mux.iterate(max_iter=15))
Out[33]:
[1.6243453636632417,
-0.6117564136500754,
-0.5281717522634557,
-1.0729686221561705,
0.8654076293246785,
1.6243453636632417,
-0.6117564136500754,
-0.5281717522634557,
-1.0729686221561705,
0.8654076293246785,
1.6243453636632417,
-0.6117564136500754,
-0.5281717522634557,
-1.0729686221561705,
0.8654076293246785]
Note that the random sequence loops with period 5. Repeatedly calling the iterator with the same state object does not do this:
In [34]: list(my_gen(s))
Out[34]:
[1.6243453636632417,
-0.6117564136500754,
-0.5281717522634557,
-1.0729686221561705,
0.8654076293246785]
In [35]: list(my_gen(s))
Out[35]:
[-2.3015386968802827,
1.74481176421648,
-0.7612069008951028,
0.31903909605709857,
-0.2493703754774101]
In [36]: list(my_gen(s))
Out[36]:
[1.462107937044974,
-2.060140709497654,
-0.3224172040135075,
-0.38405435466841564,
1.1337694423354374]
In [37]: list(my_gen(s))
Out[37]:
[-1.0998912673140309,
-0.17242820755043575,
-0.8778584179213718,
0.04221374671559283,
0.5828152137158222]
In hind-sight, this is obviously an artifact of stream activation invoking a deep copy.
@cjacoby any ideas for workarounds? One brutal hack would be to make a custom randomstate object that overrides __deepcopy__
to revert to shallow behavior. Is there something more elegant we could do?
Update: clobbering deepcopy does the trick:
In [2]: class CRandomState(np.random.RandomState):
...: def __deepcopy__(self, memodict={}):
...: return self
...:
In [3]: def my_gen(state):
...: for i in range(5):
...: yield state.randn()
...:
In [4]: s = CRandomState(1)
In [7]: streamer = pescador.Streamer(my_gen, s)
In [8]: mux = pescador.ChainMux([streamer], mode='cycle')
In [9]: list(mux.iterate(max_iter=25))
Out[9]:
[1.6243453636632417,
-0.6117564136500754,
-0.5281717522634557,
-1.0729686221561705,
0.8654076293246785,
-2.3015386968802827,
1.74481176421648,
-0.7612069008951028,
0.31903909605709857,
-0.2493703754774101,
1.462107937044974,
-2.060140709497654,
-0.3224172040135075,
-0.38405435466841564,
1.1337694423354374,
-1.0998912673140309,
-0.17242820755043575,
-0.8778584179213718,
0.04221374671559283,
0.5828152137158222,
-1.1006191772129212,
1.1447237098396141,
0.9015907205927955,
0.5024943389018682,
0.9008559492644118]
ah, great catch! so it seems like there are two options:
pescador
)(2) is definitely easier, and could be provided as the Best Practice way of seeding generators wrapped as streamers, though I worry about user error and not realizing what effects the "wrong" thing will have – using NumPy's built-in RandomState will functionally work but could statistically wreak havoc.
(1) sounds ... bad.
(2) sounds like a good solution to me.
I think the best we can do is provide ample warnings in the documentation and a gallery example of how to do it the right way.
agreed! while we're at it, does the clobbered RandomState
need to do anything else beyond breaking deepcopy?
... nothing immediately comes to mind for me.
I don't believe so. Lemme ping the scipy folks to see if they have any suggestions / see any problems with this approach.
Trying to think this through; what are the use cases here? Do we always want random states to be global to the parent/factory streamer? Or only in the case where you passed it a RandomState object / random seed?
I already had to do a little hacky thing to handle the "global" random state in mux.py::Mux
/mux.py::BaseMux
. My hunch (without checking) is that because of the way I handled that, it will actually do the right thing in Mux
for the "global RandomState" case, but it will always reset to the "Factory" copy of the RandomState if you passed in the random state.
Do we always want "child copy" streamers to behave this way? Or is there a situation where you would want it to restart?
If the answer is "random seed should always just pass by reference to copy", then I might recommend just migrating the code block below (this is the BaseMux::__deepcopy__
into pescador.Streamer
, such that Streamer
s (and therefore Mux
) just always know to that randomstates should be passed through, not copied:
def __deepcopy__(self, memo):
"""This override is required to handle copying the random_state:
when using `random_state=None`, the global state is used;
modules are not 'deepcopy-able', so we have to make a special case for
it.
"""
cls = self.__class__
copy_result = cls.__new__(cls)
memo[id(self)] = copy_result
for k, v in six.iteritems(self.__dict__):
# You can't deepcopy a module! If rng is np.random, just pass
# it over without trying.
if k == 'rng' and v == np.random:
setattr(copy_result, k, v)
# In all other cases, assume a deepcopy is the right choice.
else:
setattr(copy_result, k, copy.deepcopy(v, memo))
return copy_result
but change the loop to be:
for k, v in six.iteritems(self.__dict__):
# If rng is np.random, or np.random.RandomState, just do a shallow / copy by reference
# so that RandomState is maintained.
if v == np.random or isinstance(v, np.random.RandomState):
setattr(copy_result, k, v)
# In all other cases, assume a deepcopy is the right choice.
else:
setattr(copy_result, k, copy.deepcopy(v, memo))
(I'm happy to make this change if y'all ok that as a good plan)
I think it would be a lot clearer if we just subclassed the randomstate to do the above behavior. I could imagine use-cases where you want a random (but repeatable) sequence, and your proposed hack to not deepcopy any randomstate objects would make that impossible.
OTOH, providing an explicitly persistent randomstate object makes the intent clear, but still allows for the current behavior if that's preferable.
Ok :+1:
On Tue, Mar 13, 2018 at 10:33 AM Brian McFee notifications@github.com wrote:
I think it would be a lot clearer if we just subclassed the randomstate to do the above behavior. I could imagine use-cases where you want a random (but repeatable) sequence, and your proposed hack to not deepcopy any randomstate objects would make that impossible.
OTOH, providing an explicitly persistent randomstate object makes the intent clear, but still allows for the current behavior if that's preferable.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/131#issuecomment-372751936, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4t85_0zIia2Z_LxL1y-__wsSs_YxHWks5teAL9gaJpZM4Sne36 .
circling back to a conversation IRL, consensus is that we should introduce a PescadorSeed
(or whatever) object that holds onto a np.random.RandomState
object, rather than subclassing it directly.
I'll play with this a bit and propose something concrete when I've got it working.
ah! I've found my root problem (related to #132) – I'm mux'ing muxes, and that's really screwing some stuff up under the hood. When this is implemented, we'll need to add an additional type-check to this logic block.
I might be just getting a little fried for the end of a work day, but I don't immediately see a good way to maintain a polymorphic RandomState
interface if we're not going to subclass it. My preliminary hackjob (and test) is the following...
class StreamSeeder(object):
def __init__(self, seed):
self.rng = np.random.RandomState(seed)
@property
def state(self):
return self.rng.get_state()
@state.setter
def state(self, rng_state):
self.rng.set_state(rng_state)
@classmethod
def from_state(cls, rng_state):
obj = cls(None)
obj.state = rng_state
return obj
def __deepcopy__(self, memodict={}):
return StreamSeeder.from_state(self.state)
Then it at least basically behaves the way you'd want.
def test_StreamSeeder():
seeder = StreamSeeder(123)
values_a = [seeder.rng.randint(0, 2**16) for _ in range(200)]
seeder = StreamSeeder(123)
values_b = []
for n in range(5):
values_b += [seeder.rng.randint(0, 2**16) for _ in range(40)]
seeder = copy.deepcopy(seeder)
assert values_a == values_b
I suppose at least inside a Mux
, we might be able to get away with:
...
elif isinstance(random_state, StreamSeeder):
self.rng = random_state.rng
which ... I think does the right thing. Calls to self.rng
would still modify the root random_state.rng
object, since it's all just identifier passing.
any red flags?
which ... I think does the right thing. Calls to self.rng would still modify the root random_state.rng object, since it's all just identifier passing.
any red flags?
This will still fail when you activate the mux, because self.rng
will deepcopy on activation.
I think we should either subclass the randomstate object (probably a bad idea overall), or rewrite our randomized code to always use the protected randomstate objects where appropriate.
If we want to get fancy, we can have the protected randomstate object pass-through getattr
calls to the underlying randomstate. That might be a bad idea, but it would simplify syntax throughout.
I'm on board with this; for Mux, all Random State is handled by basically that one logic block @ejhumphrey mentioned, so modifying that to enforce random states to use our custom/subclass/deepcopy safe random state should be enough.
(though I don't like StreamSeeder
as a name. Would prefer CopySafeRandomState
or StreamRandomState
or something that makes it clear that it is a thin wrapper on numpy.random.RandomState)
as I stare down a nice long vacation, I have to wonder ... did anything ever happen here? is anyone using pescador to produce deterministically mux'ed data streams? an alternative I'd throw in the mix is to run pescador sampler once and write out the samples / batches to hdf5 as data generation, which is wasteful (disk-wise) but immediately reusable, if not reproducible.
I'm working on a tutorial leveraging pescador and would love the peace of mind that everyone gets the same experience.
did anything ever happen here?
Nope. This is really a thorny problem, and I still don't have a great sense of how things should work.
At this point, I'll reverse myself on prior suggestions, and say that subclassing randomstate would be a very bad idea, as subclassing foreign types is almost always a bad idea.
is anyone using pescador to produce deterministically mux'ed data streams?
No, but I think they should, especially when it comes to re-running training loops. I'd strongly support fixing this problem.
After some careful thought, I think I have a reasonable solution here: if you want a fully reproducible random sequence, then use the (global) numpy random state with a seed fixed at the beginning.
Thinking through the logic of Streamers, one should expect a deterministic sequence to repeat exactly on each activation. Fixing the random seed as a parameter to the generator effectively makes the sequence deterministic, so the behavior is correct as expected.
The snag here is: what if you don't seed the random state inside a mux? In that case, it must be pulling randomness from somewhere, most likely np.random
. When a streamer is activated, it deepcopies everything, including self.rng
, which will be set to the global random state if it's not provided. I think in this case, the expected behavior is not to be deterministic, and multiple activations should produce different numbers. However, the deepcopy will break that.
Solution: keep deepcopy as is, unless the object is the global random state, in which case it should be shallow-copied.
@ejhumphrey @cjacoby would this resolve things adequately in your opinions? Are there any snags I'm not seeing?
Based on our IRL conversation the other day, and this comment, I agree that thisshould work, although I'm not familiar with the internals of random state to know for sure.
On Fri, Jul 19, 2019, 14:54 Brian McFee notifications@github.com wrote:
After some careful thought, I think I have a reasonable solution here: if you want a fully reproducible random sequence, then use the (global) numpy random state with a seed fixed at the beginning.
Thinking through the logic of Streamers, one should expect a deterministic sequence to repeat exactly on each activation. Fixing the random seed as a parameter to the generator effectively makes the sequence deterministic, so the behavior is correct as expected.
The snag here is: what if you don't seed the random state inside a mux? In that case, it must be pulling randomness from somewhere, most likely np.random. When a streamer is activated, it deepcopies everything, including self.rng, which will be set to the global random state if it's not provided. I think in this case, the expected behavior is not to be deterministic, and multiple activations should produce different numbers. However, the deepcopy will break that.
Solution: keep deepcopy as is, unless the object is the global random state, in which case it should be shallow-copied.
@ejhumphrey https://github.com/ejhumphrey @cjacoby https://github.com/cjacoby would this resolve things adequately in your opinions? Are there any snags I'm not seeing?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/131?email_source=notifications&email_token=AAHC34YR6FWTKOWVTNJQHI3QAIEWXA5CNFSM4EU55X5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MO44Q#issuecomment-513338994, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHC345QQIBCXGIK2QR5ZVTQAIEWXANCNFSM4EU55X5A .
Updating on this many years later: it's possible that https://pypi.org/project/seedbank/ could be of use here.
It is a common use case for me with pescador to write a randomized "sampler" over a source of observations, e.g. a track, and then mux those together into a data stream. There are two sources of randomness in this that I want to control:
Unfortunately, if you set the seed for a sampler as an argument in the Streamer, it recreates the sample generator with the same seed every time, which is definitely not what you want (especially if
rate
<< total number of samples).I think one way around this would be to specify a
random_state
argument in Streamers that gets caught, and can itself be an iterable / generator. Then, each time a Streamer is (re) activated, it could (deterministically) pop the next value off the stack, and then it's seeded all the way through.thoughts?