scrapinghub / exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
BSD 3-Clause "New" or "Revised" License
40 stars 10 forks source link

Add a reservoir sampling filter #317

Closed eliasdorneles closed 7 years ago

eliasdorneles commented 8 years ago

It would be nice to have an easy way of getting random samples from an infinite amount of data, and I believe a filter implementing a reservoir sampling algorithm keeping the samples in memory would be a good enough approach for most purposes.

The samples being in memory imposes some limits to the maximum amount of samples, but this is probably okay for an initial implementation, and might even be okay for a long-lasting one. We can change it later to support persisting to disk if necessary, but I have the feeling it won't be needed. :)

eliasdorneles commented 8 years ago

I was discussing this with @raphapassini and realized that it's not that simple to implement this as a filter. The filter would have to know when the input is finished or when the items limit was reached in order to know when to "flush" the reservoir.

So, this needs a bit more discussion, some possible approaches I can think of are:

The tradeoffs between the two aren't clear to me, so I'm not sure what's best.

kalessin commented 7 years ago

Available now since version 0.6.13