twitter / algebird

Abstract Algebra for Scala
https://twitter.github.io/algebird
Apache License 2.0
2.29k stars 345 forks source link

implement AMS Sketch (great for join size estimation) #563

Open sritchie opened 7 years ago

sritchie commented 7 years ago

Paper: http://dimacs.rutgers.edu/~graham/pubs/papers/encalgs-ams.pdf

From the abstract:

The AMS Sketch is focused on approximating the sum of squared entries of a vector defined by a stream of updates. This quantity is naturally related to the Euclidean norm of the vector, and so has many applications in high-dimensional geometry, and in data mining and machine learning settings that use vector representations of data.

The section of this paper called "Using the AMS Sketch to estimate inner products" is the key use case here. We can use the AMS Sketch to estimate join sizes between tables in Scalding, SQL etc.

The section of that paper called "Comparing AMS and Count-Min sketches for join size estimation" makes the case that AMS sketches are better than CMS for this task.

sritchie commented 7 years ago

Seems related to #362 ?