prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.89k stars 5.32k forks source link

APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share #11807

Open cploonker opened 5 years ago

cploonker commented 5 years ago

Presto aggregate function: APPROX_HEAVY_HITTERS(A, min_percent_share, ε, δ) -> MAP(K, V)

A= column of the table. In other words, entire array of values. n= total number of values(rows) in A min_percent_share= User provided parameter. The values returned should atleast have this much share in all the values processed. In other words min_percent share of 10 means return only those heave hitters whose occurence is atleast 10% of the overall volume. ε= error bound such that counts are overestimated by at most εn. Default value=0.01 OR 1/2k OR min_percent_share/200 δ= probability that the count is overestimated by more than the error bound εn. Default value=0.01 MAP(K, V)= Map of heavy hitter values as keys and the occurrence counts as values. k= Variable used in the referenced paper. min_percent_share=100/k.

  1. Every value that occurs at least n/k times is in the list. No false negatives.
  2. Worst case(for default parameters) values that occur at least n/2k times but less than n/k times can make it to the list with probability of δ. Chances of false positives.

Example use case

Let's say there is a table with each record representing a visitor and the corresponding domain visited. This function can be useful to get the top domains by visit count and approx visit count. Can be even more valuable to find top domains by country.

Algorithm

For complete background on the algorithm refer to heavy hitters in: http://theory.stanford.edu/~tim/s17/l/l2.pdf

Data structures to hold the data

  1. Maintain a standard Count-Min sketch during the scan of the data set and put all elements into it : https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/ConservativeAddSketch.java b= e/ε. number of buckets(width) in count-min-sketch. e is the base of natural log=2.718 l >= ln(1/δ) number of hash functions(depth) in count-min-sketch
  2. Counter m of the total number of processed elements so far.
  3. Maintain heavy hitters values which occur at least m/k where m is the number of values processed so far. https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/execution/resourceGroups/IndexedPriorityQueue.java

Logic to add elements into the above data structures:

  1. Put the element into the sketch Inc(x) followed by Count(x)
  2. If Count(x) >= m*min_percent_share/100 then add/update the x into the heavy hitters heap. Here m=number of rows processed till now.
  3. Whenever m grows to the point that some object x stored in the heap falls below min_percent_share (checkable in O(1) time via Find-Min), we delete x from the heap (via Extract-Min). After finishing the pass, we output all of the objects in the heap.
  4. If the element which is dropped from the heavy hitters list re-appears and crosses min_percent_share than it will make it into the heavy hitters list again.
  5. Heap contains at most 2k elements at all times as per the referred paper. Note k = 100/min_percent_share
  6. The “no large errors” assumption implies an approximate: every object x in the heap has true frequency count at least n/k − εn = n/2k (other objects would be deleted from the heap by the end of the pass). If the count-min sketch makes large errors on a few objects, then these objects might erroneously appear in final output.

Error bounds

Counts are overestimated by at most εn except in a small probability of δ

Why not top K elements

  1. Since in presto calculations are distributed at block level, if we try to maintain a list of top K elements then there is a chance that a specific element does not make it to the top K list within individual blocks but when put together should have been part of the top K list. Such elements will be missed out if we build a top_k_elements version of the function. By using min_percent_share individual blocks will always have all elements that are above min_percent_share. Moreover when merging blocks percent_share can only go down with respect to the highest percent_share in one of the block and hence no risk of missing out on an element which is above min_percent_share.

Resources:

  1. http://theory.stanford.edu/~tim/s17/l/l2.pdf
  2. This algorithm is used by redis as well to implement top_hitters: https://redislabs.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/
  3. https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
rongrong commented 5 years ago

Can you provide some example use case? It's not clear to me what kind of use case would require approximation on top N, and how using approximation would help with memory footprint. Thanks!

cploonker commented 5 years ago

@rongrong thanks for your attention.

Let us say we have a huge table with each row representing the domain and the user visiting it. If my interest is to only know the top domains which are visited the most and how many times is each visited this function would make it easy to do that kind of query. Imagine if i want the same result but for each country.

About the memory, as described in the logic above since we are using count-min-sketch the memory footprint is greatly reduced. Here is a link which shows that a 40MB data can be held in 48KB of data in count-min-sketch: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Hope this answers your question and please feel free to let me know if i can answer any more questions.

jonw73330 commented 5 years ago

@rongrong - Similar to cploonker's feedback

max_by(,,N) is a reasonably common pattern. For datasets that are very high cardinality (and usually highly skewed), being able to identify top items is important. (ex> Top ID's, words, url's.) The current implementation is excluding many workloads from using presto due to memory constraints. Having a small error rate in these datasets is an acceptable compromise. This seems consistent with the other HLL approx functions.

tompetrillo commented 5 years ago

I also think this would be useful. Suppose I have a table of errors and different products that the errors were experienced in, then it is very useful to see the top 4 or 5 most commonly occurring errors for each product. Generally speaking, this is a more advanced version of MODE which is also not available out-of-the-box. There are methods to obtain it, but the approximation dramatically improves speed. I would like this.

tompetrillo commented 5 years ago

I would comment, that I think the name should change. APPROX should happen at the end of the function. Something like MOST_FREQUENT_OCCURING_APPROX has more obvious meaning.

abhrajitmukherjee commented 5 years ago

This is a much needed feature and will be super useful for the Data Engineers of the field who heavily use presto

cploonker commented 5 years ago

I would comment, that I think the name should change. APPROX should happen at the end of the function. Something like MOST_FREQUENT_OCCURING_APPROX has more obvious meaning.

@tompetrillo, i would let the presto team decide the function name to align with their naming convention. I don't have any strong opinion about the name.

tdcmeehan commented 5 years ago

The above algorithm will work in specific cases where we know the incoming dataset and can carefully tune our epsilon and delta values. In general, however, it seems to be an undue burden to ask that a user give good values for delta and epsilon. The costs of getting it wrong are steep: a saturated CMS will falsely report heavy hitters with high probability. The function could throw once it's saturated to a point, but this increases the overhead and burden of the function. It might also push people to use larger sizes than necessary. The holy grail is an algorithm that adapts to the input, while still preserving lossless merging of intermediate aggregate states as will be necessary in Presto, without data quality compromises for unskewed distributions.

joelbecker commented 4 years ago

I just want to +1 this. This is a useful pattern that shouldn't require a subquery or additional CTE.