numpy / numpy

The fundamental package for scientific computing with Python.
https://numpy.org
Other
28.02k stars 10.1k forks source link

weighted percentile #6326

Closed anntzer closed 1 year ago

anntzer commented 9 years ago

Support for weights in percentile would be nice to have. A quick look suggests https://github.com/nudomarinero/wquantiles; I'd be happy to make a PR out of this implementation if there's interest.

shoyer commented 9 years ago

The linked implementation uses a sort to do weighted quantiles. From a performance perspective, that is definitely not ideal -- I'm pretty sure this can be done in time O(n) instead of O(n log(n)).

anntzer commented 9 years ago

It is pretty clear that (a trivial adaptation of) quickselect can achieve O(n) performance for extracting a single percentile (not that I'd really want to write one).

However (while this is probably a separate issue), I believe that a basic approach will mean (even in the unweighted case) a O(kn) cost for computing k percentiles. Indeed, a basic timing of np.percentile(np.arange(n), np.linspace(0, 100, n)) for various n suggests a O(n²) cost for extracting n percentiles off a size n array, where the "naïve" sorting approach trivially achieves O(k + n log n).

NeilGirdhar commented 9 years ago

@anntzer I believe the P-square algorithm can find n percentiles in linear time in the size of the data and maybe logarithmic in the size of the buckets?

anntzer commented 9 years ago

A quick looks seems to suggest that this is an approximate algorithm, which is too bad.

Investigating a bit more, I see that for small arrays (100 elements), full sorting is faster that introselect even for a single percentile! For 1000 elements, full sorting is faster as soon as we request at least 5 (equally spaced) percentile points. (All timings were done with interpolation="higher" to simplify the sorting version of the code, but this in fact helps the introselect version more, as the default interpolation="linear" effectively doubles the number of percentiles that are needed).

madphysicist commented 8 years ago

I have been doing some research on this. The best I have been able to determine so far is O(log^2(n)) for partitioning with arbitrary weights. This is not a trivial problem.

anntzer commented 8 years ago

FWIW, https://www.mpi-inf.mpg.de/~mehlhorn/ftp/multiselection.ps claims to present a "near optimal" algorithm for multiple selection.

eric-wieser commented 4 years ago

Related to #8935

lorentzenchr commented 1 year ago

Could this issue be closed as predecessing duplicate of #8935?

seberg commented 1 year ago

Yeah, sure, why not. If anyone seriously pushes this again they probably find if there is anything interesting here. Thanks for the note.