Closed nathanmarz closed 12 years ago
More info for anyone interested in tackling this: http://blogs.msdn.com/b/spt/archive/2008/02/05/reservoir-sampling.aspx
And here's an example defparallelbuf:
https://github.com/nathanmarz/cascalog/blob/master/src/clj/cascalog/ops.clj#L91
And here are the vars used by limit
:
https://github.com/nathanmarz/cascalog/blob/master/src/clj/cascalog/ops_impl.clj#L43
Java impl:
(Thanks, @sorenmacbeth!)
I've submitted a pull request #61 that implements this.
Boom, added! Thanks to @sorenmacbeth and @nathanmarz.
http://en.wikipedia.org/wiki/Reservoir_sampling
This should be implemented as a parallelbuf. This will be useful for doing things like terasort (sample 10K tuples to determine partition boundaries, then do the sort in a subquery)