Closed MichaelChirico closed 3 years ago
I think it's that way for historical reasons, and based on the conversation in the linked issue above it confirms it. We'll be happy to make the change here once it's available in airlift.
airlift is now done
This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.
From
approx_percentile(x, w, percent)
:I don't understand well the implementation of
approx_percentile
in the case of using weights. IIUC it's driven by the Q-digest data type from this paper: http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdfBut I don't see any mention of incorporating weights there; I poked around in
presto
source code but as near as I can tell it's being sent upstream toairlift
and I don't understand what's going on thereIt's very common to have double weights (e.g. I guess most users are accustomed to normalizing weights to sum to one; the case that brought me to post this was about using an exponential weighting kernel in geospatial setting where weights are between 0 and 1, but can sum to anything). Is there anything preventing the implementation to use double weights?
It's of course possible to kludge this by doing
cast(pow(10, k)*double_weight, bigint)
for somek
of your choosing (orpow(2, k)
); this introduces some noise to the estimation, but the function is already calledapprox_percentile
.I did the following to explore this approach over a variety of data sizes in terms of (1) accuracy as a function of
k
and (2) impact ofk
on timing. Basically I found thatk
can increase timing a fair amount but the impact on accuracy is hard to notice in extremely toy example:CSV as TXT
Took quite a while to run so saving the timings here for future reference
approx_percentiles_timings.txt
I suppose it'd be appropriate to do a more formal statistical analysis of the nature of the error introduced by this rounding approach before building it into
approx_percentile
...Long story short, (1) Is it possible to adjust the current algorithm for weighted percentiles to accept
double
weights? (2) If not, is it possible to kludge the function into doing so by either internally applying some rounding or exposing a new parameter to dictate the degree of rounding?(PS, I recognize the way of generating dummy tables of a given size with
cross join () cross join ()
is a bit clunky... any suggestion for improving this?)