pescadores / pescador

Stochastic multi-stream sampling for iterative learning
https://pescador.readthedocs.io
ISC License
75 stars 12 forks source link

on seeding Streamer-wrapped generators #131

Open ejhumphrey opened 6 years ago

ejhumphrey commented 6 years ago

It is a common use case for me with pescador to write a randomized "sampler" over a source of observations, e.g. a track, and then mux those together into a data stream. There are two sources of randomness in this that I want to control:

Unfortunately, if you set the seed for a sampler as an argument in the Streamer, it recreates the sample generator with the same seed every time, which is definitely not what you want (especially if rate << total number of samples).

I think one way around this would be to specify a random_state argument in Streamers that gets caught, and can itself be an iterable / generator. Then, each time a Streamer is (re) activated, it could (deterministically) pop the next value off the stack, and then it's seeded all the way through.

thoughts?

bmcfee commented 6 years ago

That's an interesting problem. I don't think the streamer or mux objects need to get involved though.

I think it can be handled effectively by passing a random_state object to the streamer (by reference), rather than a seed? That way, the state object persists across instantiations of the streamer, but the overall sequence is still reproducible. Something like:

streams = [pescador.Streamer(my_sampler, np.random.RandomState(my_seed), other_params) for other_params in ...]

You just need to write my_sampler so that it can accept randomstate objects directly. A quick spot check verifies that randomstates can be serialized, so I don't see any problems down the line.

ejhumphrey commented 6 years ago

so I tried this (or tried to try it), and I believe that it doesn't do the trick. I believe that I observed the same RandomState object being passed to each Sampler when it gets activated; in other words, resetting the generator also resets the RandomState object, as I don't think it is stateful across activations.

I'll give it a looksee a little later tho to be sure, about to drop into a meeting.

bmcfee commented 6 years ago

believe that I observed the same RandomState object being passed to each Sampler when it gets activated; in other words, resetting the generator also resets the RandomState object, as I don't think it is stateful across activations.

The object should be the same (hash match), but its internal state should persist and advance across activations.

ejhumphrey commented 6 years ago

should, yes, but I don't think that's the behavior I observed. There was a bunch of code around that check, though, so I want to isolate that exactly and know for sure.

Also I was flying back from Europe in some kind of delirious fugue state.

bmcfee commented 6 years ago

Confirmed:

In [29]: def my_gen(state):
    ...:     for i in range(5):
    ...:         yield state.randn()
    ...:         

In [30]: s = np.random.RandomState(1)

In [31]: streamer = pescador.Streamer(my_gen, s)

In [32]: mux = pescador.ChainMux([streamer], mode='cycle')

In [33]: list(mux.iterate(max_iter=15))
Out[33]: 
[1.6243453636632417,
 -0.6117564136500754,
 -0.5281717522634557,
 -1.0729686221561705,
 0.8654076293246785,
 1.6243453636632417,
 -0.6117564136500754,
 -0.5281717522634557,
 -1.0729686221561705,
 0.8654076293246785,
 1.6243453636632417,
 -0.6117564136500754,
 -0.5281717522634557,
 -1.0729686221561705,
 0.8654076293246785]

Note that the random sequence loops with period 5. Repeatedly calling the iterator with the same state object does not do this:

In [34]: list(my_gen(s))
Out[34]: 
[1.6243453636632417,
 -0.6117564136500754,
 -0.5281717522634557,
 -1.0729686221561705,
 0.8654076293246785]

In [35]: list(my_gen(s))
Out[35]: 
[-2.3015386968802827,
 1.74481176421648,
 -0.7612069008951028,
 0.31903909605709857,
 -0.2493703754774101]

In [36]: list(my_gen(s))
Out[36]: 
[1.462107937044974,
 -2.060140709497654,
 -0.3224172040135075,
 -0.38405435466841564,
 1.1337694423354374]

In [37]: list(my_gen(s))
Out[37]: 
[-1.0998912673140309,
 -0.17242820755043575,
 -0.8778584179213718,
 0.04221374671559283,
 0.5828152137158222]

In hind-sight, this is obviously an artifact of stream activation invoking a deep copy.

@cjacoby any ideas for workarounds? One brutal hack would be to make a custom randomstate object that overrides __deepcopy__ to revert to shallow behavior. Is there something more elegant we could do?

bmcfee commented 6 years ago

Update: clobbering deepcopy does the trick:

In [2]: class CRandomState(np.random.RandomState):
   ...:     def __deepcopy__(self, memodict={}):
   ...:         return self
   ...:     

In [3]: def my_gen(state):
   ...:     for i in range(5):
   ...:         yield state.randn()
   ...:         

In [4]: s = CRandomState(1)

In [7]: streamer = pescador.Streamer(my_gen, s)

In [8]: mux = pescador.ChainMux([streamer], mode='cycle')

In [9]: list(mux.iterate(max_iter=25))
Out[9]: 
[1.6243453636632417,
 -0.6117564136500754,
 -0.5281717522634557,
 -1.0729686221561705,
 0.8654076293246785,
 -2.3015386968802827,
 1.74481176421648,
 -0.7612069008951028,
 0.31903909605709857,
 -0.2493703754774101,
 1.462107937044974,
 -2.060140709497654,
 -0.3224172040135075,
 -0.38405435466841564,
 1.1337694423354374,
 -1.0998912673140309,
 -0.17242820755043575,
 -0.8778584179213718,
 0.04221374671559283,
 0.5828152137158222,
 -1.1006191772129212,
 1.1447237098396141,
 0.9015907205927955,
 0.5024943389018682,
 0.9008559492644118]
ejhumphrey commented 6 years ago

ah, great catch! so it seems like there are two options:

  1. Fix copy logic around activation (yuck)
  2. Introduce a deepcopy-safe RandomState objects (perhaps in pescador)

(2) is definitely easier, and could be provided as the Best Practice way of seeding generators wrapped as streamers, though I worry about user error and not realizing what effects the "wrong" thing will have – using NumPy's built-in RandomState will functionally work but could statistically wreak havoc.

(1) sounds ... bad.

bmcfee commented 6 years ago

(2) sounds like a good solution to me.

I think the best we can do is provide ample warnings in the documentation and a gallery example of how to do it the right way.

ejhumphrey commented 6 years ago

agreed! while we're at it, does the clobbered RandomState need to do anything else beyond breaking deepcopy?

... nothing immediately comes to mind for me.

bmcfee commented 6 years ago

I don't believe so. Lemme ping the scipy folks to see if they have any suggestions / see any problems with this approach.

cjacoby commented 6 years ago

Trying to think this through; what are the use cases here? Do we always want random states to be global to the parent/factory streamer? Or only in the case where you passed it a RandomState object / random seed?

I already had to do a little hacky thing to handle the "global" random state in mux.py::Mux/mux.py::BaseMux. My hunch (without checking) is that because of the way I handled that, it will actually do the right thing in Mux for the "global RandomState" case, but it will always reset to the "Factory" copy of the RandomState if you passed in the random state.

Do we always want "child copy" streamers to behave this way? Or is there a situation where you would want it to restart?

If the answer is "random seed should always just pass by reference to copy", then I might recommend just migrating the code block below (this is the BaseMux::__deepcopy__ into pescador.Streamer, such that Streamers (and therefore Mux) just always know to that randomstates should be passed through, not copied:

def __deepcopy__(self, memo):
        """This override is required to handle copying the random_state:
        when using `random_state=None`, the global state is used;
        modules are not 'deepcopy-able', so we have to make a special case for
        it.
        """
        cls = self.__class__
        copy_result = cls.__new__(cls)
        memo[id(self)] = copy_result
        for k, v in six.iteritems(self.__dict__):
            # You can't deepcopy a module! If rng is np.random, just pass
            # it over without trying.
            if k == 'rng' and v == np.random:
                setattr(copy_result, k, v)
            # In all other cases, assume a deepcopy is the right choice.
            else:
                setattr(copy_result, k, copy.deepcopy(v, memo))

        return copy_result

but change the loop to be:

        for k, v in six.iteritems(self.__dict__):
            # If rng is np.random, or np.random.RandomState, just do a shallow / copy by reference
            # so that RandomState is maintained.
            if v == np.random or isinstance(v, np.random.RandomState):
                setattr(copy_result, k, v)

            # In all other cases, assume a deepcopy is the right choice.
            else:
                setattr(copy_result, k, copy.deepcopy(v, memo))
cjacoby commented 6 years ago

(I'm happy to make this change if y'all ok that as a good plan)

bmcfee commented 6 years ago

I think it would be a lot clearer if we just subclassed the randomstate to do the above behavior. I could imagine use-cases where you want a random (but repeatable) sequence, and your proposed hack to not deepcopy any randomstate objects would make that impossible.

OTOH, providing an explicitly persistent randomstate object makes the intent clear, but still allows for the current behavior if that's preferable.

cjacoby commented 6 years ago

Ok :+1:

On Tue, Mar 13, 2018 at 10:33 AM Brian McFee notifications@github.com wrote:

I think it would be a lot clearer if we just subclassed the randomstate to do the above behavior. I could imagine use-cases where you want a random (but repeatable) sequence, and your proposed hack to not deepcopy any randomstate objects would make that impossible.

OTOH, providing an explicitly persistent randomstate object makes the intent clear, but still allows for the current behavior if that's preferable.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/131#issuecomment-372751936, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4t85_0zIia2Z_LxL1y-__wsSs_YxHWks5teAL9gaJpZM4Sne36 .

ejhumphrey commented 6 years ago

circling back to a conversation IRL, consensus is that we should introduce a PescadorSeed (or whatever) object that holds onto a np.random.RandomState object, rather than subclassing it directly.

I'll play with this a bit and propose something concrete when I've got it working.

ejhumphrey commented 6 years ago

ah! I've found my root problem (related to #132) – I'm mux'ing muxes, and that's really screwing some stuff up under the hood. When this is implemented, we'll need to add an additional type-check to this logic block.

I might be just getting a little fried for the end of a work day, but I don't immediately see a good way to maintain a polymorphic RandomState interface if we're not going to subclass it. My preliminary hackjob (and test) is the following...

class StreamSeeder(object):

    def __init__(self, seed):
        self.rng = np.random.RandomState(seed)

    @property
    def state(self):
        return self.rng.get_state()

    @state.setter
    def state(self, rng_state):
        self.rng.set_state(rng_state)

    @classmethod
    def from_state(cls, rng_state):
        obj = cls(None)
        obj.state = rng_state
        return obj

    def __deepcopy__(self, memodict={}):
        return StreamSeeder.from_state(self.state)

Then it at least basically behaves the way you'd want.

def test_StreamSeeder():
    seeder = StreamSeeder(123)

    values_a = [seeder.rng.randint(0, 2**16) for _ in range(200)]
    seeder = StreamSeeder(123)
    values_b = []
    for n in range(5):
        values_b += [seeder.rng.randint(0, 2**16) for _ in range(40)]
        seeder = copy.deepcopy(seeder)

    assert values_a == values_b

I suppose at least inside a Mux, we might be able to get away with:

...
elif isinstance(random_state, StreamSeeder):
    self.rng = random_state.rng

which ... I think does the right thing. Calls to self.rng would still modify the root random_state.rng object, since it's all just identifier passing.

any red flags?

bmcfee commented 6 years ago

which ... I think does the right thing. Calls to self.rng would still modify the root random_state.rng object, since it's all just identifier passing.

any red flags?

This will still fail when you activate the mux, because self.rng will deepcopy on activation.

I think we should either subclass the randomstate object (probably a bad idea overall), or rewrite our randomized code to always use the protected randomstate objects where appropriate.

If we want to get fancy, we can have the protected randomstate object pass-through getattr calls to the underlying randomstate. That might be a bad idea, but it would simplify syntax throughout.

cjacoby commented 6 years ago

I'm on board with this; for Mux, all Random State is handled by basically that one logic block @ejhumphrey mentioned, so modifying that to enforce random states to use our custom/subclass/deepcopy safe random state should be enough.

(though I don't like StreamSeeder as a name. Would prefer CopySafeRandomState or StreamRandomState or something that makes it clear that it is a thin wrapper on numpy.random.RandomState)

ejhumphrey commented 5 years ago

as I stare down a nice long vacation, I have to wonder ... did anything ever happen here? is anyone using pescador to produce deterministically mux'ed data streams? an alternative I'd throw in the mix is to run pescador sampler once and write out the samples / batches to hdf5 as data generation, which is wasteful (disk-wise) but immediately reusable, if not reproducible.

I'm working on a tutorial leveraging pescador and would love the peace of mind that everyone gets the same experience.

bmcfee commented 5 years ago

did anything ever happen here?

Nope. This is really a thorny problem, and I still don't have a great sense of how things should work.

At this point, I'll reverse myself on prior suggestions, and say that subclassing randomstate would be a very bad idea, as subclassing foreign types is almost always a bad idea.

is anyone using pescador to produce deterministically mux'ed data streams?

No, but I think they should, especially when it comes to re-running training loops. I'd strongly support fixing this problem.

bmcfee commented 5 years ago

After some careful thought, I think I have a reasonable solution here: if you want a fully reproducible random sequence, then use the (global) numpy random state with a seed fixed at the beginning.

Thinking through the logic of Streamers, one should expect a deterministic sequence to repeat exactly on each activation. Fixing the random seed as a parameter to the generator effectively makes the sequence deterministic, so the behavior is correct as expected.

The snag here is: what if you don't seed the random state inside a mux? In that case, it must be pulling randomness from somewhere, most likely np.random. When a streamer is activated, it deepcopies everything, including self.rng, which will be set to the global random state if it's not provided. I think in this case, the expected behavior is not to be deterministic, and multiple activations should produce different numbers. However, the deepcopy will break that.

Solution: keep deepcopy as is, unless the object is the global random state, in which case it should be shallow-copied.

@ejhumphrey @cjacoby would this resolve things adequately in your opinions? Are there any snags I'm not seeing?

cjacoby commented 5 years ago

Based on our IRL conversation the other day, and this comment, I agree that thisshould work, although I'm not familiar with the internals of random state to know for sure.

On Fri, Jul 19, 2019, 14:54 Brian McFee notifications@github.com wrote:

After some careful thought, I think I have a reasonable solution here: if you want a fully reproducible random sequence, then use the (global) numpy random state with a seed fixed at the beginning.

Thinking through the logic of Streamers, one should expect a deterministic sequence to repeat exactly on each activation. Fixing the random seed as a parameter to the generator effectively makes the sequence deterministic, so the behavior is correct as expected.

The snag here is: what if you don't seed the random state inside a mux? In that case, it must be pulling randomness from somewhere, most likely np.random. When a streamer is activated, it deepcopies everything, including self.rng, which will be set to the global random state if it's not provided. I think in this case, the expected behavior is not to be deterministic, and multiple activations should produce different numbers. However, the deepcopy will break that.

Solution: keep deepcopy as is, unless the object is the global random state, in which case it should be shallow-copied.

@ejhumphrey https://github.com/ejhumphrey @cjacoby https://github.com/cjacoby would this resolve things adequately in your opinions? Are there any snags I'm not seeing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/131?email_source=notifications&email_token=AAHC34YR6FWTKOWVTNJQHI3QAIEWXA5CNFSM4EU55X5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MO44Q#issuecomment-513338994, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHC345QQIBCXGIK2QR5ZVTQAIEWXANCNFSM4EU55X5A .

bmcfee commented 8 months ago

Updating on this many years later: it's possible that https://pypi.org/project/seedbank/ could be of use here.