Closed anntzer closed 1 year ago
The linked implementation uses a sort to do weighted quantiles. From a performance perspective, that is definitely not ideal -- I'm pretty sure this can be done in time O(n) instead of O(n log(n)).
It is pretty clear that (a trivial adaptation of) quickselect can achieve O(n) performance for extracting a single percentile (not that I'd really want to write one).
However (while this is probably a separate issue), I believe that a basic approach will mean (even in the unweighted case) a O(kn) cost for computing k percentiles. Indeed, a basic timing of np.percentile(np.arange(n), np.linspace(0, 100, n))
for various n suggests a O(n²) cost for extracting n percentiles off a size n array, where the "naïve" sorting approach trivially achieves O(k + n log n).
@anntzer I believe the P-square algorithm can find n percentiles in linear time in the size of the data and maybe logarithmic in the size of the buckets?
A quick looks seems to suggest that this is an approximate algorithm, which is too bad.
Investigating a bit more, I see that for small arrays (100 elements), full sorting is faster that introselect even for a single percentile! For 1000 elements, full sorting is faster as soon as we request at least 5 (equally spaced) percentile points. (All timings were done with interpolation="higher"
to simplify the sorting version of the code, but this in fact helps the introselect version more, as the default interpolation="linear"
effectively doubles the number of percentiles that are needed).
I have been doing some research on this. The best I have been able to determine so far is O(log^2(n))
for partitioning with arbitrary weights. This is not a trivial problem.
FWIW, https://www.mpi-inf.mpg.de/~mehlhorn/ftp/multiselection.ps claims to present a "near optimal" algorithm for multiple selection.
Related to #8935
Could this issue be closed as predecessing duplicate of #8935?
Yeah, sure, why not. If anyone seriously pushes this again they probably find if there is anything interesting here. Thanks for the note.
Support for weights in
percentile
would be nice to have. A quick look suggests https://github.com/nudomarinero/wquantiles; I'd be happy to make a PR out of this implementation if there's interest.