pescadores / pescador

Stochastic multi-stream sampling for iterative learning
https://pescador.readthedocs.io
ISC License
76 stars 12 forks source link

Interface to modify StochasticMux weights #94

Open ejhumphrey opened 7 years ago

ejhumphrey commented 7 years ago

For reasons that I like and make @bmcfee nervous, I'd like to implement an interface that would allow a user to modify the distribution of the Mux on the fly.

ejhumphrey commented 7 years ago

I'll volunteer to take a look at this and determine whether this is in/out for 1.1.

bmcfee commented 7 years ago

I think it should probably stay out for 1.1, since 2 will include a pretty substantial refactor of the mux anyway.

ejhumphrey commented 7 years ago

punt on this for 1.1, deal with on a case by case basis, maybe factor common interfaces out later should they (actually) materialize

bmcfee commented 6 years ago

Circling back since the mux refactor is finshed #107.

@ejhumphrey do you have concrete ideas for what exactly you want to do to mux states? I'm still wary about allowing external control, since that would really mess up things like chainmux or roundrobin.

bmcfee commented 6 years ago

Re-upping for the new year: @ejhumphrey ?

ejhumphrey commented 6 years ago

Thanks for the poke ... if I'm not mistaken, based on my peripheral awareness of what's been happening here, I'll need to set aside some time to look into the current state of master to respond intelligently. I'll throw this on the backlog, but it's looking like I can get to this at the end of the week (couple Weds deadlines).

cjacoby commented 6 years ago

I'm in NYC this week if you need some guidance; I've got all this solidly in my brain right now.

On Mon, Jan 8, 2018 at 7:34 AM Eric J. Humphrey notifications@github.com wrote:

Thanks for the poke ... if I'm not mistaken, based on my peripheral awareness of what's been happening here, I'll need to set aside some time to look into the current state of master to respond intelligently. I'll throw this on the backlog, but it's looking like I can get to this at the end of the week (couple Weds deadlines).

โ€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/94#issuecomment-355999636, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4t80Qjo22uzTvE6eaTzKMgPRK1TlETks5tIjV9gaJpZM4NSXcK .

bmcfee commented 6 years ago

I'm wondering if it makes sense to punt this up to 2.1, and scoping 2.0 to just be the mux refactor (and requisite other updates)?

cjacoby commented 6 years ago

I don't have a good sense of what is trying to be achieved with this one. Unlike #116 and #110, I think this one has a slightly higher probability of affecting the API, so it might be good to at least get an "on-paper" proposal in this thread for exactly what this might mean before deciding that?

bmcfee commented 6 years ago

Fair enough. Pinging @ejhumphrey again.

ejhumphrey commented 6 years ago

one specific use-case I'm thinking of is re-weighting streams / samples as a function of model performance, e.g. spend less time of the data points covered by the model. Not all observations have equal information gain, and it's intuitively appealing to spend less time on data that doesn't add much to the error signal. Haven't done any digging, but I'd assume there's some theory out there on this.

maybe this is most easily achieved by creating a new stream every so often ... but there might be gains if this were plumbed directly back into a stream at each iteration (won't have to open / close views of data on disk).

This seems pretty advanced all things considered, so I'm content to continue dragging this along for posterity.

bmcfee commented 6 years ago

one specific use-case I'm thinking of is re-weighting streams / samples as a function of model performance, e.g. spend less time of the data points covered by the model.

The machine-learner in me gets real nervous around dynamically changing the training distribution in-flight, since it makes convergence a real thorny issue.

I think the easiest way to make this work, if you're really set on it, is to add an interface to updating the weights vector. I don't see this breaking any of the rest of the API changes from the 2.0 refactor, so I think we can safely punt to 2.1.

Haven't done any digging, but I'd assume there's some theory out there on this.

This sounds like boosting / hard-core sets, for which there's tons of literature. (Not sure any of it applies to the usual pescador applications trained with sgd though.)

maybe this is most easily achieved by creating a new stream every so often

I don't think we should enable this kind of behavior. It would vastly complicate the implementation because all the parameter vectors of length n_streams have to be reconstructed to length n_streams+1. Better IMO to keep the streamer set fixed.

bmcfee commented 6 years ago

Reflecting on this a bit more: the one big missing piece of this is a way to track which iterates came from which streamers, so that you have a way to measure the loss per streamer and derive weight updates accordingly. I see two options here:

  1. Customize your streamers to include metadata as part of their output stream
  2. Adapt the mux to generate the metadata for you

Option 2 makes me nervous because it's fundamentally incompatible with the core Streamer API -- why would a single streamer ever need to do that?

Option 1 seems preferable, but then it's on the user to manage the metadata / streamer index. I think this is probably a good thing anyway, as your proposed use case would (I think) require a lot of custom evaluation code anyway, so a little metadata overhead seems like NBD.

Other arguments for option 1:

cjacoby commented 6 years ago

My 2ยข: I think that 1 is much cleaner, and allows us to punt these sort of decisions to the "application developer". This is key:

All metadata becomes application logic, so we don't have to change pescador at all

As long as each/every streamer includes sufficient metadata in each "sample" procuded, it should be entirely possible to track data sources based on that.

On Mon, Jan 22, 2018 at 11:46 AM Brian McFee notifications@github.com wrote:

Reflecting on this a bit more: the one big missing piece of this is a way to track which iterates came from which streamers, so that you have a way to measure the loss per streamer and derive weight updates accordingly. I see two options here:

  1. Customize your streamers to include metadata as part of their output stream
  2. Adapt the mux to generate the metadata for you

Option 2 makes me nervous because it's fundamentally incompatible with the core Streamer API -- why would a single streamer ever need to do that?

Option 1 seems preferable, but then it's on the user to manage the metadata / streamer index. I think this is probably a good thing anyway, as your proposed use case would (I think) require a lot of custom evaluation code anyway, so a little metadata overhead seems like NBD.

Other arguments for option 1:

  • All metadata becomes application logic, so we don't have to change pescador at all
  • It lets you get creative with the index. Maybe streamers don't identify themeselves specifically, but by group ids (example: genre). Then you can have shared group ids between train and validation -- when you validate, if one group id is doing significantly worse, then you can go and re-up the weights for training streamers belonging to that group in the train mux. This sort of thing would be impossible in a top-down control scenario (option 2).

โ€” You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/94#issuecomment-359542689, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4t843uMWrYNBKC1I_JNcZwGj2vEJaTks5tNOWjgaJpZM4NSXcK .

bmcfee commented 6 years ago

I think I'm convinced that option 1 above is the way to go, and that this issue really simplifies to the updated title (modifying stochasticmux weights in-flight).

So the question now becomes: can mux weights be updated in flight without breaking the 2.0 api? I can imagine two reasons this could be difficult, neither of which have to do with the API per se:

  1. weights is an object variable, but it gets copied over to distribution_ when the mux is active. Any manipulation would have to update both weights and distribution.
  2. The context-manager copy of the mux object could make this impossible to access, since active_mux is local to the iterate method. I don't know of a good solution to this problem, and it does not seem to be a function of the user-facing API. Rather, it's a function of requiring composability of muxen, which lead to the copy-on-context model adopted in #113 so that multiple activation does not break.

To summarize:

  1. StochasticMux activates itself with a copy. If a mux object is multiply activated (eg, by a higher level mux with `mode='with_replacement'), then there will be multiple active copies of the same mux object.
  2. This means that we can't directly access the active instance of a stochasticmux object, because there could be more than one.
  3. This in turn means that updating the weights of mux would not affect its active copy. It could take effect when the copy is deactivated and reactivated, but the mux itself has no direct control over that if it's being controlled by a higher mux.
  4. If we tried to hack this by overriding the __deepcopy__ method so that weights are shallow-copied (by reference), that might work, unless it crosses a process boundary (eg by upstream encapsulation in ZMQ).

I think I'm convinced that this entire issue is a non-starter, but I'm open to creative arguments to the contrary.

cjacoby commented 6 years ago

I actually had a think about this yesterday; one way to update the active instances would be "event"-based callback functions, given to the "original" Mux, somewhat Keras-like in style.

If we decide to go this route, we may want to consider putting it into 2.0. It should be only additive to the API, though, so I could see delaying on it. I think this sort of solution could be clean, and allow what Eric is looking for without baking exactly that behavior into the core functionality.

bmcfee commented 6 years ago

I actually had a think about this yesterday; one way to update the active instances would be "event"-based callback functions, given to the "original" Mux, somewhat Keras-like in style.

Can you say more about how this would work? AFAICT it would require the parent mux keeping a handle to all clones. Is there another way?

cjacoby commented 6 years ago

I'll prototype this in a gist in the next day or two and try to see if what I'm thinking actually makes any sense / works the way I think it does.

bmcfee commented 6 years ago

@cjacoby Thanks -- keep me posted. This might be the last blocker on 2.0, so it'd be good to sort out its status soon.

cjacoby commented 6 years ago

Agreed ๐Ÿ‘Œ.

On Wed, Jan 24, 2018 at 2:01 PM Brian McFee notifications@github.com wrote:

@cjacoby https://github.com/cjacoby Thanks -- keep me posted. This might be the last blocker on 2.0, so it'd be good to sort out its status soon.

โ€” You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/94#issuecomment-360288374, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4t84GjjSssP96bvVWkeCygDQv_Kt_Qks5tN6gngaJpZM4NSXcK .

cjacoby commented 6 years ago

Okay! I did some homework. I didn't do it for mux yet, just for the base Streamer class, but for proof of concept, I think it does the thing. Please observe the changes on this branch; it's basically just hacked together, but it shows the thing.

Basically, I added a CallbackList and PescadorCallback, very much the way that they work in Keras. (Copied, really). With this code as a base, I ran the following test code:

import logging
import pescador

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("testLogger")

def gen_str(the_str):
    for char in the_str:
        yield char

class TestCallback(pescador.PescadorCallback):
    def on_activated(self, logs=None):
        logger.info("Streamer on_activated (streamer={})".format(
            self.streamer))

    def on_completed(self, logs=None):
        logger.info("Streamer on_completed (streamer={})".format(
            self.streamer))

    def on_exit(self, logs=None):
        logger.info("Streamer on_exit (streamer={})".format(
            self.streamer))

    def on_cycle(self, logs=None):
        logger.info("Streamer on_cycle (streamer={})".format(
            self.streamer))

a = pescador.Streamer(gen_str, "aaaaaaa", callbacks=[TestCallback()])

print("First Test")
print(list(a.iterate(5)))

print("Second Test")
print(list(a.iterate()))

print("Third Test (Cycle)")
print(list(a.cycle(max_iter=20)))

Test Output

First Test
INFO:testLogger:Streamer on_activated (streamer=<pescador.core.Streamer object at 0x10538cd30>)
INFO:testLogger:Streamer on_completed (streamer=<pescador.core.Streamer object at 0x10538cd30>)
INFO:testLogger:Streamer on_exit (streamer=<pescador.core.Streamer object at 0x103b64e80>)
['a', 'a', 'a', 'a', 'a']
Second Test
INFO:testLogger:Streamer on_activated (streamer=<pescador.core.Streamer object at 0x1056097f0>)
INFO:testLogger:Streamer on_completed (streamer=<pescador.core.Streamer object at 0x1056097f0>)
INFO:testLogger:Streamer on_exit (streamer=<pescador.core.Streamer object at 0x103b64e80>)
['a', 'a', 'a', 'a', 'a', 'a', 'a']
Third Test (Cycle)
INFO:testLogger:Streamer on_activated (streamer=<pescador.core.Streamer object at 0x105666588>)
INFO:testLogger:Streamer on_completed (streamer=<pescador.core.Streamer object at 0x105666588>)
INFO:testLogger:Streamer on_exit (streamer=<pescador.core.Streamer object at 0x103b64e80>)
INFO:testLogger:Streamer on_cycle (streamer=<pescador.core.Streamer object at 0x103b64e80>)
INFO:testLogger:Streamer on_activated (streamer=<pescador.core.Streamer object at 0x1056665f8>)
INFO:testLogger:Streamer on_completed (streamer=<pescador.core.Streamer object at 0x1056665f8>)
INFO:testLogger:Streamer on_exit (streamer=<pescador.core.Streamer object at 0x103b64e80>)
INFO:testLogger:Streamer on_cycle (streamer=<pescador.core.Streamer object at 0x103b64e80>)
INFO:testLogger:Streamer on_activated (streamer=<pescador.core.Streamer object at 0x105666710>)
INFO:testLogger:Streamer on_exit (streamer=<pescador.core.Streamer object at 0x103b64e80>)
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

Conclusions

I'm available and happy to take this on this weekend, if you approve / think it's a good idea, although I would like your input as to what appropriate callbacks would be (and maybe on documentation).

ejhumphrey commented 6 years ago

okay, few thoughts.

  1. I REALLY like this design.
  2. I still think this is some advanced-as-hell functionality, and is out of scope for 2.0. I'd be keen to start developing ideas around this, but it might be prudent to spend some more time kicking this around and prototyping. It feels like we'll benefit from more first-hand experience with this, and this shouldn't block 2.0.
  3. I realize now that you can kind of achieve the functionality I had in mind for this issue in the first place by re-creating streamers with different weights. I'm fine with this for now.
cjacoby commented 6 years ago

Roger that; I am pro keeping 2.0 reasonable so we can get it out the door.

bmcfee commented 6 years ago

Awesome, thanks for digging into this! I've reassigned to 2.1.

bmcfee commented 5 years ago

I'm still not 100% sure how this is going to work. Do callback objects get propagated to the activated copy of a mux? What does the callback API look like across streamers and muxen? How does a callback function access the mux's internal state?