spotify / scio

A Scala API for Apache Beam and Google Cloud Dataflow.
https://spotify.github.io/scio
Apache License 2.0
2.55k stars 512 forks source link

Approximate top-K frequent items #2161

Open nevillelyh opened 5 years ago

nevillelyh commented 5 years ago

@idreeskhan pointed me to Space Saving and other variants, approximate algorithms that can answer top K items & frequencies. Could be nice to have.

Right now we have Count-Min Sketch from Algebird that can only answer approx freq of item X, and exact top N with countByValue.top(k, Ordering.by(_._2)).

http://www.cse.ust.hk/~raywong/comp5331/References/EfficientComputationOfFrequentAndTop-kElementsInDataStreams.pdf https://zliu.org/post/topk-in-stream/ https://www.cs.rice.edu/~as143/Papers/topkapi.pdf

anish749 commented 5 years ago

It would be nice to ensure / warn about cardinality as well, because (last time I checked) the Space Saving algo and other top k approximations work for low cardinality data. The results can be quite misleading with high cardinality data.