Closed faroit closed 3 years ago
I think you're seeing small sample effects here: remember that rate gives you a random (poisson) number of samples for each streamer. Since it's poisson distributed, the higher the rate, the higher the variance, and it looks like the numbers you report are not atypical.
If you run it longer (say max_iter=6000) does it converge?
If you run it longer (say max_iter=6000) does it converge?
I guess what I was looking for is something different: I want the weights
+ if one of the streamers hit a maximum number of samples, it is should be exhausted. What would be the best way to accomplish that using pescador?
What would be the best way to accomplish that using pescador?
If you want a hard limit on the number of samples per streamer, you could build that into the streamer itself rather than relying on the mux to do it for you.
If you want a hard limit on the number of samples per streamer, you could build that into the streamer itself rather than relying on the mux to do it for you.
understood...
just out of interest, was this a feature that was meant to be part of the Streamer
object (still in the docs as max_items
) ?
Yup! It is implemented by Streamer, so you can do it that way too. Sometimes it's easier to write it directly into the underlying data generator, and sometimes it's easier to put hard limiting in as a streamer parameter. Both should work.
so its max_iter
not max_items
...
I guess the docstring needs to be updated. Can add a PR
whoops -- good catch! A PR would be most appreciated.
will do. Works fine now. Before closing this out: I still don't understand the value of the rate parameter in practice. Can you give a very brief example assuming i have k
infinite generators from which I want to equally (or weighted) sample max_iter=n
number of samples maximum. Also no generator should be restarted once it is exhausted...
As i understand its only useful for out-of-core sampling, right?
@bmcfee can you confirm my understanding? this can be closed then
Can you give a very brief example assuming i have
k
infinite generators from which I want to equally (or weighted) samplemax_iter=n
number of samples maximum. Also no generator should be restarted once it is exhausted...
You probably wouldn't use the rate parameter in this context.
The rate parameter was originally intended for the case where you have a large set of potential streams, but only resources to keep some small number k
of streams active at a time. So you pick k
streams at random, draw samples from them, and occasionally evict an active stream and pull in a new one. Now, you could do this at a fixed rate (eg, evict a stream once it produces n
samples), but we wanted to avoid bursts of many streams being replaced in a short span of time. The rate parameter defines a poisson distribution, so that each stream is bounded by a different (random) number of samples, and this helps spread out the evictions. (NB: the analysis of this is a real pain; I have a draft writeup of an analysis if you use a negative binomial distribution instead of poisson. This is much more tractable to make sense of, but the two distributions are practically pretty close anyway.)
As i understand its only useful for out-of-core sampling, right?
OOC is the main application for active set sampling, but one could imagine others as well (eg curriculum learning).
@bmcfee Okay, I got it now. Thanks for the detailed explanation! 🎅
given a simple toy scenario of 3 generators...
getting all samples until exhaustion:
Now, when I want to add weighting to the sampling process, e.g.
results are as expected...
Now, if I wanted limit the number of samples to be sampled from a streamer, I was assuming thats what the
rate
parameter would do (on average):However, I get:
max_iter=60
for the last two examples?I'm not sure if I understand the rate parameters role in the sampling process correctly... I would be very happy for a quick pointer @bmcfee @cjacoby 🙏