rockset / community

Rockset community content
Apache License 2.0
16 stars 6 forks source link

Percentile Aggregate Functions #13

Open bjovanovic84 opened 4 years ago

bjovanovic84 commented 4 years ago

Is your feature request related to a problem? Please describe. It would be very useful to add a percentile aggregation function, in addition to the standard min, max, avg. When performing analysis it's often a requirement to report p95, p99 values, it's not easy to do so right now.

Describe the solution you'd like Adding something like [MIN|MAX]_PCT(<percentile>, <column>) would be ideal, to be used like MAX_PCT(99, value), MIN_PCT(20, value) etc.

larskoole commented 3 years ago

Definitely something we are also looking forward too. We use percentile for so many of our statistics and charts. This is the only thing that is stopping us from using Rockset.

benhannel commented 3 years ago

You can emulate this functionality with the NTH_VALUE window function.

luketn commented 2 years ago

This worked for me:

SELECT
    v.attributeId, 
    MIN(v.value) "min",
    ELEMENT_AT(ARRAY_SORT(ARRAY_AGG(v.value)), LENGTH(ARRAY_AGG(v.value))*0.25) twenty_fifth,
    ELEMENT_AT(ARRAY_SORT(ARRAY_AGG(v.value)), LENGTH(ARRAY_AGG(v.value))*0.5) median,
    ELEMENT_AT(ARRAY_SORT(ARRAY_AGG(v.value)), LENGTH(ARRAY_AGG(v.value))*0.75) seventy_fifth,
    ELEMENT_AT(ARRAY_SORT(ARRAY_AGG(v.value)), LENGTH(ARRAY_AGG(v.value))*0.95) ninety_fifth,
    ELEMENT_AT(ARRAY_SORT(ARRAY_AGG(v.value)), LENGTH(ARRAY_AGG(v.value))*0.99) ninety_ninth,
    MAX(v.value) "max"
FROM
    commons.attributeValues v
group by v.attributeId

@benhannel can you see any issue with that approach? Would there be any way to avoid the repetition of:

ARRAY_AGG(v.value)

throughout the query?

benhannel commented 2 years ago

Yes, the approach you have there will work. You can also use a subquery to avoid some duplication of ARRAY_SORT(ARRAY_AGG(v.value)). There are also some window functions which could be useful to you: https://rockset.com/docs/window-functions/#ntile

It's worth noting that any mechanism to compute an exact percentile value will take memory which is linear in the number of input rows. Max and standard deviation are much cheaper to compute, if sufficient for your use case.