pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Weighted quantile #1371

Closed chunweiyuan closed 4 months ago

chunweiyuan commented 7 years ago

For our work we frequently need to compute weighted quantiles. This is especially important when we need to weigh data from recent years more heavily in making predictions.

I've put together a function (called weighted_quantile) largely based on the source code of np.percentile. It allows one to input weights along a single dimension, as a dict w_dict. Below are some manual tests:

When all weights = 1, it's identical to using np.nanpercentile:

>>> ar0
<xarray.DataArray (x: 3, y: 4)>
array([[3, 4, 8, 1],
       [5, 3, 7, 9],
       [4, 9, 6, 2]])
Coordinates:
  * x        (x) |S1 'a' 'b' 'c'
  * y        (y) int64 0 1 2 3
>>> ar0.quantile(q=[0.25, 0.5, 0.75], dim='y')
<xarray.DataArray (quantile: 3, x: 3)>
array([[ 2.5 ,  4.5 ,  3.5 ],
       [ 3.5 ,  6.  ,  5.  ],
       [ 5.  ,  7.5 ,  6.75]])
Coordinates:
  * x         (x) |S1 'a' 'b' 'c'
  * quantile  (quantile) float64 0.25 0.5 0.75
>>> weighted_quantile(da=ar0, q=[0.25, 0.5, 0.75], dim='y', w_dict={'y': [1,1,1,1]})
<xarray.DataArray (quantile: 3, x: 3)>
array([[ 2.5 ,  4.5 ,  3.5 ],
       [ 3.5 ,  6.  ,  5.  ],
       [ 5.  ,  7.5 ,  6.75]])
Coordinates:
  * x         (x) |S1 'a' 'b' 'c'
  * quantile  (quantile) float64 0.25 0.5 0.75

Now different weights:

>>> weighted_quantile(da=ar0, q=[0.25, 0.5, 0.75], dim='y', w_dict={'y': [1,2,3,4.0]})
<xarray.DataArray (quantile: 3, x: 3)>
array([[ 3.25    ,  5.666667,  4.333333],
       [ 4.      ,  7.      ,  5.333333],
       [ 6.      ,  8.      ,  6.75    ]])
Coordinates:
  * x         (x) |S1 'a' 'b' 'c'
  * quantile  (quantile) float64 0.25 0.5 0.75

Also handles nan values like np.nanpercentile:

>>> ar
<xarray.DataArray (x: 2, y: 2, z: 2)>
array([[[ nan,   3.],
        [ nan,   5.]],

       [[  8.,   1.],
        [ nan,   0.]]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 0 1
  * z        (z) int64 8 9
>>> da_stacked = ar.stack(mi=['x', 'y'])
>>> out = weighted_quantile(da=ar, q=[0.25, 0.5, 0.75], dim=['x', 'y'], w_dict={'x': [1, 1]})
>>> out
<xarray.DataArray (quantile: 3, z: 2)>
array([[ 8.  ,  0.75],
       [ 8.  ,  2.  ],
       [ 8.  ,  3.5 ]])
Coordinates:
  * z         (z) int64 8 9
  * quantile  (quantile) float64 0.25 0.5 0.75
>>> da_stacked.quantile(q=[0.25, 0.5, 0.75], dim='mi')
<xarray.DataArray (quantile: 3, z: 2)>
array([[ 8.  ,  0.75],
       [ 8.  ,  2.  ],
       [ 8.  ,  3.5 ]])
Coordinates:
  * z         (z) int64 8 9
  * quantile  (quantile) float64 0.25 0.5 0.75

Lastly, different interpolation schemes are consistent:

>>> out = weighted_quantile(da=ar, q=[0.25, 0.5, 0.75], dim=['x', 'y'], w_dict={'x': [1, 1]}, interpolation='nearest')
>>> out
<xarray.DataArray (quantile: 3, z: 2)>
array([[ 8.,  1.],
       [ 8.,  3.],
       [ 8.,  3.]])
Coordinates:
  * z         (z) int64 8 9
  * quantile  (quantile) float64 0.25 0.5 0.75
>>> da_stacked.quantile(q=[0.25, 0.5, 0.75], dim='mi', interpolation='nearest')
<xarray.DataArray (quantile: 3, z: 2)>
array([[ 8.,  1.],
       [ 8.,  3.],
       [ 8.,  3.]])
Coordinates:
  * z         (z) int64 8 9
  * quantile  (quantile) float64 0.25 0.5 0.75

We wonder if it's ok to make this part of xarray. If so, the most logical place to implement it would seem to be in Variable.quantile(). Another option is to make it a utility function, to be called as xr.weighted_quantile().

shoyer commented 7 years ago

I'm sure this is useful, but we try to avoid putting new numeric methods in xarray itself. Would the underlying weighted quantile method (on NumPy arrays) be appropriate for numpy or scipy? Then we might consider adding a wrapper function in xarray (though again, we have to be cautious to avoid overloading xarray with too many methods).

chunweiyuan commented 7 years ago

Cool, let me toss it out on the numpy board and see what they think.

stale[bot] commented 5 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

dcherian commented 5 years ago

We could also add things like this to a cookbook section in the docs

chunweiyuan commented 5 years ago

So, as per suggestion by Stephan, I submitted the PR to numpy instead: https://github.com/numpy/numpy/pull/9211

Despite having possibly the highest number of comments of all active numpy PRs and passing all the tests, it's been sitting in limbo for the last few months. There seems to be quite some uncertainty about what to do with it, and the discussion has gone off tangent a bit to more upstream issues.

My hope is still to have it eventually merged into numpy, so that it could be easily ported to xarray. It's proven to be quite useful to my coworkers, as well as some of the numpy users. I believe it'll also serve xarray well. Thank you all for your patience.

max-sixty commented 5 years ago

Would you like to leave this open @chunweiyuan ?

That's quite a thread! I'm impressed by your persistence

chunweiyuan commented 5 years ago

My personal hope is to keep this thread open just for the record. But given the non-activity on the numpy end, I honestly can't promise any resolution to this issue in the near future. Thanks!

PS I persist because some people do seem to appreciate that PR and have forked it for their own use :)

shoyer commented 5 years ago

NumPy does have a pretty bad review back-log :(

On Fri, Mar 15, 2019 at 11:01 AM chunweiyuan notifications@github.com wrote:

My personal hope is to keep this thread open just for the record. But given the non-activity on the numpy end, I honestly can't promise any resolution to this issue in the near future. Thanks!

PS I persist because some people do seem to appreciate that PR and have forked it for their own use :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1371#issuecomment-473387146, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1gzAFmpKOTSi0ljefC46A79vp8m1ks5vW9_8gaJpZM4M73c4 .

max-sixty commented 4 months ago

Closing as stale but please reopen if sufficiently pressing.

FYI there are quantile routines in numbagg — I'd be happy to accept a PR adding a weights arg to that....