Open cploonker opened 6 years ago
Can you provide some example use case? It's not clear to me what kind of use case would require approximation on top N, and how using approximation would help with memory footprint. Thanks!
@rongrong thanks for your attention.
Let us say we have a huge table with each row representing the domain and the user visiting it. If my interest is to only know the top domains which are visited the most and how many times is each visited this function would make it easy to do that kind of query. Imagine if i want the same result but for each country.
About the memory, as described in the logic above since we are using count-min-sketch the memory footprint is greatly reduced. Here is a link which shows that a 40MB data can be held in 48KB of data in count-min-sketch: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
Hope this answers your question and please feel free to let me know if i can answer any more questions.
@rongrong - Similar to cploonker's feedback
max_by(,,N) is a reasonably common pattern. For datasets that are very high cardinality (and usually highly skewed), being able to identify top items is important. (ex> Top ID's, words, url's.) The current implementation is excluding many workloads from using presto due to memory constraints. Having a small error rate in these datasets is an acceptable compromise. This seems consistent with the other HLL approx functions.
I also think this would be useful. Suppose I have a table of errors and different products that the errors were experienced in, then it is very useful to see the top 4 or 5 most commonly occurring errors for each product. Generally speaking, this is a more advanced version of MODE which is also not available out-of-the-box. There are methods to obtain it, but the approximation dramatically improves speed. I would like this.
I would comment, that I think the name should change. APPROX should happen at the end of the function. Something like MOST_FREQUENT_OCCURING_APPROX has more obvious meaning.
This is a much needed feature and will be super useful for the Data Engineers of the field who heavily use presto
I would comment, that I think the name should change. APPROX should happen at the end of the function. Something like MOST_FREQUENT_OCCURING_APPROX has more obvious meaning.
@tompetrillo, i would let the presto team decide the function name to align with their naming convention. I don't have any strong opinion about the name.
The above algorithm will work in specific cases where we know the incoming dataset and can carefully tune our epsilon and delta values. In general, however, it seems to be an undue burden to ask that a user give good values for delta and epsilon. The costs of getting it wrong are steep: a saturated CMS will falsely report heavy hitters with high probability. The function could throw once it's saturated to a point, but this increases the overhead and burden of the function. It might also push people to use larger sizes than necessary. The holy grail is an algorithm that adapts to the input, while still preserving lossless merging of intermediate aggregate states as will be necessary in Presto, without data quality compromises for unskewed distributions.
I just want to +1 this. This is a useful pattern that shouldn't require a subquery or additional CTE.
Presto aggregate function: APPROX_HEAVY_HITTERS(A, min_percent_share, ε, δ) -> MAP(K, V)
A= column of the table. In other words, entire array of values. n= total number of values(rows) in A min_percent_share= User provided parameter. The values returned should atleast have this much share in all the values processed. In other words min_percent share of 10 means return only those heave hitters whose occurence is atleast 10% of the overall volume. ε= error bound such that counts are overestimated by at most εn. Default value=0.01 OR 1/2k OR min_percent_share/200 δ= probability that the count is overestimated by more than the error bound εn. Default value=0.01 MAP(K, V)= Map of heavy hitter values as keys and the occurrence counts as values. k= Variable used in the referenced paper. min_percent_share=100/k.
Example use case
Let's say there is a table with each record representing a visitor and the corresponding domain visited. This function can be useful to get the top domains by visit count and approx visit count. Can be even more valuable to find top domains by country.
Algorithm
For complete background on the algorithm refer to heavy hitters in: http://theory.stanford.edu/~tim/s17/l/l2.pdf
Data structures to hold the data
Logic to add elements into the above data structures:
Error bounds
Counts are overestimated by at most εn except in a small probability of δ
Why not top K elements
Resources: