pescadores / pescador

Stochastic multi-stream sampling for iterative learning
https://pescador.readthedocs.io
ISC License
76 stars 12 forks source link

StochasticMux: Unterstanding combination of `rate` and `weights` #161

Closed faroit closed 3 years ago

faroit commented 3 years ago

given a simple toy scenario of 3 generators...

def max_n(x, n):
    for i in range(n):
        yield x

# Create a collection of exponential streamers
streamers = [
    pescador.Streamer(
        max_n,
        str(n),
        n*10
    )
    for n in range(1,4)
]

getting all samples until exhaustion:

mux = pescador.StochasticMux(
    streamers, 
    len(streamers),
    rate=None,
    mode="exhaustive"
)

coll = collections.OrderedDict(
    collections.Counter("".join(mux(max_iter=60))).most_common()
)
plt.bar(coll.keys(), coll.values())
plt.show()

original

Now, when I want to add weighting to the sampling process, e.g.

mux = pescador.StochasticMux(
    streamers,
    len(streamers),
    rate=None,
    weights=[0.1, 0.1, 0.8],
    mode="single_active"
)

results are as expected...

weighting

Now, if I wanted limit the number of samples to be sampled from a streamer, I was assuming thats what the rate parameter would do (on average):

mux = pescador.StochasticMux(
    streamers,
    len(streamers),
    rate=20,
    weights=[0.1, 0.1, 0.8],
    mode="single_active"
)

However, I get:

rate

I'm not sure if I understand the rate parameters role in the sampling process correctly... I would be very happy for a quick pointer @bmcfee @cjacoby 🙏

bmcfee commented 3 years ago

I think you're seeing small sample effects here: remember that rate gives you a random (poisson) number of samples for each streamer. Since it's poisson distributed, the higher the rate, the higher the variance, and it looks like the numbers you report are not atypical.

If you run it longer (say max_iter=6000) does it converge?

faroit commented 3 years ago

If you run it longer (say max_iter=6000) does it converge?

6000

I guess what I was looking for is something different: I want the weights + if one of the streamers hit a maximum number of samples, it is should be exhausted. What would be the best way to accomplish that using pescador?

bmcfee commented 3 years ago

What would be the best way to accomplish that using pescador?

If you want a hard limit on the number of samples per streamer, you could build that into the streamer itself rather than relying on the mux to do it for you.

faroit commented 3 years ago

If you want a hard limit on the number of samples per streamer, you could build that into the streamer itself rather than relying on the mux to do it for you.

understood...

just out of interest, was this a feature that was meant to be part of the Streamer object (still in the docs as max_items) ?

image

bmcfee commented 3 years ago

Yup! It is implemented by Streamer, so you can do it that way too. Sometimes it's easier to write it directly into the underlying data generator, and sometimes it's easier to put hard limiting in as a streamer parameter. Both should work.

faroit commented 3 years ago

so its max_iter not max_items...

https://github.com/pescadores/pescador/blob/dff2c75e5cbfaa5b03c7fd94ccfc546658bed600/pescador/core.py#L184

I guess the docstring needs to be updated. Can add a PR

bmcfee commented 3 years ago

whoops -- good catch! A PR would be most appreciated.

faroit commented 3 years ago

will do. Works fine now. Before closing this out: I still don't understand the value of the rate parameter in practice. Can you give a very brief example assuming i have k infinite generators from which I want to equally (or weighted) sample max_iter=n number of samples maximum. Also no generator should be restarted once it is exhausted...

As i understand its only useful for out-of-core sampling, right?

faroit commented 3 years ago

@bmcfee can you confirm my understanding? this can be closed then

bmcfee commented 3 years ago

Can you give a very brief example assuming i have k infinite generators from which I want to equally (or weighted) sample max_iter=n number of samples maximum. Also no generator should be restarted once it is exhausted...

You probably wouldn't use the rate parameter in this context.

The rate parameter was originally intended for the case where you have a large set of potential streams, but only resources to keep some small number k of streams active at a time. So you pick k streams at random, draw samples from them, and occasionally evict an active stream and pull in a new one. Now, you could do this at a fixed rate (eg, evict a stream once it produces n samples), but we wanted to avoid bursts of many streams being replaced in a short span of time. The rate parameter defines a poisson distribution, so that each stream is bounded by a different (random) number of samples, and this helps spread out the evictions. (NB: the analysis of this is a real pain; I have a draft writeup of an analysis if you use a negative binomial distribution instead of poisson. This is much more tractable to make sense of, but the two distributions are practically pretty close anyway.)

As i understand its only useful for out-of-core sampling, right?

OOC is the main application for active set sampling, but one could imagine others as well (eg curriculum learning).

faroit commented 3 years ago

@bmcfee Okay, I got it now. Thanks for the detailed explanation! 🎅