pescadores / pescador

Stochastic multi-stream sampling for iterative learning
https://pescador.readthedocs.io
ISC License
76 stars 12 forks source link

Alternate distributions for StochasticMux #148

Closed bmcfee closed 8 months ago

bmcfee commented 5 years ago

I've been doing a bit of analysis of pescador's sampling behavior, and the Poisson distribution is a bit unwieldy. However, for a specific choice of parameters, drawing n_samples_to_stream from a binomial distribution leads to a clean solution in terms of total replacement rate of streamers.

More generally, I think it would be useful to extend StochasticMux to have a dist= parameter that works as follows:

bmcfee commented 5 years ago

Quick summary of the analysis: say we have uniform weights over n_active active streamers. Each streamer has a rate limit that's sampled from a binomial distribution

n_samples ~ Binomial(rate * n_active / (n_active - 1), (n_active - 1) / n_active).

These expectations match exactly the case where the number of samples is deterministic: n_samples = rate. The benefit of randomizing n_samples is that you don't get a mass extinction event every rate + K samples; the replacements are less concentrated.

In exhaustive mode, we want to know how long until we hit all the streamers. We can think of this as each streamer slot being replaced n_streams/n_active - 1 times, and the expected time cooks down to rate * (n_streams - n_active) / n_active (exactly what you'd hope for).

Finally, we'd want to know what the chance of a streamer not being replaced is. This can be calculated numerically using the negative binomial survival function (which is ugly AF), but we could also be lazy and use some concentration inequalities. If we think of rate * n_active as the average time of an 'episode', then we could ask what the chance of a streamer lasting some b>=1 episodes longer than average. This probability is upper-bounded by 1/(b**2 * rate); so with high probability, every streamer will be replaced quickly and the distribution should not become unbalanced. (Corollary: the approximation for time to exhaustion above should be pretty good.)

lostanlen commented 5 years ago

Cool!

So is pescador going to be renamed to pascalador then?*

*(https://en.wikipedia.org/wiki/Binomial_distribution#History)

bmcfee commented 5 years ago

So is pescador going to be renamed to pascalador then?*

Negative.

cjacoby commented 5 years ago

:+1: :100:

bmcfee commented 5 years ago

Quick update: I implemented this and compared poisson/binomial/constant using the notebook linked from #104. This basically computes the rate at which the muxed distribution approaches the full distribution. The idealized case with n_active = n_total is given in the right-most columns for context.

TLDR there's not much difference across the different schemes, but the scalloping #132 effects are obscured by error regions in these plots. They'll be more pronounced for binomial and constant; less for poisson. The randomized offset idea mentioned at the end of #132 could be a nice solution to that though, and is not tested here.

Results:

image

image

image

bmcfee commented 4 years ago

Would anyone object to me implementing this and pushing out a 2.2 release, rather than waiting for 3.x?

cjacoby commented 4 years ago

No objection; available for code review.

-C

On Wed, Jan 8, 2020 at 9:46 AM Brian McFee notifications@github.com wrote:

Would anyone object to me implementing this and pushing out a 2.2 release, rather than waiting for 3.x?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/148?email_source=notifications&email_token=AAHC34ZM3DUYS25J4XBNR6LQ4YGQDA5CNFSM4IEBIMDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINMOUQ#issuecomment-572180306, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHC346MWIOQSSCC5H7TNETQ4YGQDANCNFSM4IEBIMDA .

bmcfee commented 4 years ago

One irritating snag here: the binomial distribution explodes on us if n_active=1, because the rate parameter gets scaled up by n_active / (n_active - 1). This is a weird edge case that we never had to worry about before, since the poisson model didn't mix in the active set size.

For n_active > 1, the binomial model has the invariant that the expected number of samples per streamer is always rate, and the variance scales like rate / n_active. So large active sets mean low variance in the number of samples per streamer; small active sets should have high variance in the number of samples per streamer. When n_active=1, maintaining this would require a random variable with mean = variance = rate, eg a poisson variable.

Does it make sense to fall back on poisson for this case? Or should we simply require n_active >= 2 for dist='binomial'? The latter side-steps the issue, but makes the API somewhat inelegant.

cjacoby commented 4 years ago

I'm fine with falling back to poisson for n_active == 1, as long as it is clearly documented in the docstring.

My gut is that restricting n_active >= 2 is a worse use experience than falling back to poisson, but I'm not confident in that feeling right now. I suppose we could also raise a warning to make it clear to the end user that that had happened.

On Wed, Jan 8, 2020 at 11:11 AM Brian McFee notifications@github.com wrote:

One irritating snag here: the binomial distribution explodes on us if n_active=1, because the rate parameter gets scaled up by n_active / (n_active - 1). This is a weird edge case that we never had to worry about before, since the poisson model didn't mix in the active set size.

For n_active > 1, the binomial model has the invariant that the expected number of samples per streamer is always rate, and the variance scales like rate / n_active. So large active sets mean low variance in the number of samples per streamer; small active sets should have high variance in the number of samples per streamer. When n_active=1, maintaining this would require a random variable with mean = variance = rate, eg a poisson variable.

Does it make sense to fall back on poisson for this case? Or should we simply require n_active >= 2 for dist='binomial'? The latter side-steps the issue, but makes the API somewhat inelegant.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/148?email_source=notifications&email_token=AAHC345HMIO3KLUDLOOPHFTQ4YQNNA5CNFSM4IEBIMDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINUQIA#issuecomment-572213280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHC343ZWL67YWZVPLKDAEDQ4YQNNANCNFSM4IEBIMDA .

bmcfee commented 4 years ago

My gut is that restricting n_active >= 2 is a worse use experience than falling back to poisson, but I'm not confident in that feeling right now.

I agree with that. I went and redid the analysis allowing for non-uniform streamer weights, and it came out a bit more nuanced. The end story is the same though, convergence to poisson makes sense when we only have one active streamer and still want a randomized refresh.

In implementing this, I ran into some snags with the unit tests. In the poisson rate model, we hacked a bias of +1 into the sampler to avoid having streamers that start and stop without generating data. This is mostly a desirable property in practice, but it messes with the analysis pretty badly. I'd like to avoid that in the binomial model. However, this means that the binomial model does need to handle the n=0 case gracefully. Since our current implementation never had to handle that, it was never explicitly tested .. unsurprisingly, it doesn't work. Will probably take a good amount of hacking to sort it out.

bmcfee commented 4 years ago

quick update: the problem had to do with pruning empty streams.

bmcfee commented 4 years ago

To clarify my earlier comment: the issue is that if we set n_samples_per_stream = 0 (randomly selected from binomial or poisson), then a stream can be activated, never produce samples, and then be deactivated and removed from the stream, even though it could produce new data.

What do you (all) think about removing the stream pruning behavior? Or at least, changing the default behavior to not prune. I think it's ultimately inconsistent with our design principles that streams should be restartable (that being a distinguishing point compared to iterables).

cjacoby commented 4 years ago

I am pro changing the default behavior in this way.

I think it would be nice to get some more input from other users on how they typically use Pescador. I at least hardly used the pruning behavior, so to me this is fine, but maybe there are people that do?

On Tue, Jan 14, 2020 at 4:14 PM Brian McFee notifications@github.com wrote:

To clarify my earlier comment: the issue is that if we set n_samples_per_stream = 0 (randomly selected from binomial or poisson), then a stream can be activated, never produce samples, and then be deactivated and removed from the stream, even though it could produce new data.

What do you (all) think about removing the stream pruning behavior? Or at least, changing the default behavior to not prune. I think it's ultimately inconsistent with our design principles that streams should be restartable (that being a distinguishing point compared to iterables).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/148?email_source=notifications&email_token=AAHC344RHSZEC6WCLSM3KFLQ5ZIQDA5CNFSM4IEBIMDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI6TAVI#issuecomment-574435413, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHC3475IOBNDPYD2XK7PBTQ5ZIQDANCNFSM4IEBIMDA .

bmcfee commented 9 months ago

image

Since I'm finally getting around to doing some cleanup here for a 3.0 release, I'll quickly summarize how I think this should all go:

  1. Modify the poisson mode so that we do 1 + poisson(λ-1) instead of 1 + poisson(λ). This way, the expected value is λ but we never terminate a streamer without trying to pull from it at least once.
  2. Add a constant mode that always returns (up to) λ samples. Streamers can self-terminate of course, but there would be no randomness from the mux side in this mode.
  3. Add a binomial mode where the number of samples is 1 + Bin((λ-1)/(1-q), 1-q) with q as the (normalized) weight attached to that streamer. This avoids the problem identified earlier (pruning a stream prematurely) while preserving the expected value of λ.
  4. With the above, it won't matter so much if we retain the default (True) for pruning empty streams.
  5. We still need to handle the q=1 case, ie where there is only one active streamer, when in binomial mode. Under the definition above, the binomial would be undefined, but we can appeal to the Poisson limit theorem to replace it by a draw from Poisson(λ-1) as the limiting distribution as q→0 (and n→∞). (Hurray, binomial is still a little fishy, so the name isn't misleading!) I'm happy with this solution, as it behaves consistently and still provides the right kind of randomness in the edge cases.