skuschel / generatorpipeline

Parallelize your data-processing pipelines with just a decorator.
GNU General Public License v3.0
2 stars 3 forks source link

Reservoir Sampling #37

Open skuschel opened 1 year ago

skuschel commented 1 year ago

Inspired by this article, reservoir sampling could be interesting to draw and update a representative sample. https://towardsdatascience.com/introduction-to-streaming-algorithms-b71808de6d29 more technical https://en.wikipedia.org/wiki/Reservoir_sampling

This should be inside a new accumulator RandomSample(length=1) where the sample size is given by the parameter length. Will probably be quite similar to CacheAccumulator(length).