nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

Add reservoir sampling operator #47

Closed nathanmarz closed 12 years ago

nathanmarz commented 12 years ago

http://en.wikipedia.org/wiki/Reservoir_sampling

This should be implemented as a parallelbuf. This will be useful for doing things like terasort (sample 10K tuples to determine partition boundaries, then do the sort in a subquery)

sritchie commented 12 years ago

More info for anyone interested in tackling this: http://blogs.msdn.com/b/spt/archive/2008/02/05/reservoir-sampling.aspx

And here's an example defparallelbuf:

https://github.com/nathanmarz/cascalog/blob/master/src/clj/cascalog/ops.clj#L91

And here are the vars used by limit:

https://github.com/nathanmarz/cascalog/blob/master/src/clj/cascalog/ops_impl.clj#L43

sritchie commented 12 years ago

Java impl:

https://github.com/codahale/metrics/blob/master/metrics-core/src/main/java/com/yammer/metrics/stats/UniformSample.java

(Thanks, @sorenmacbeth!)

sorenmacbeth commented 12 years ago

I've submitted a pull request #61 that implements this.

sritchie commented 12 years ago

Boom, added! Thanks to @sorenmacbeth and @nathanmarz.