Weighed quantiles - Githubissues

tmct commented 1 year ago

Problem description

Hi - I would find it very useful to be able to perform the "quantile" methods on DataFrames (and ideally LazyFrames) with a optional sample weight column please.

Something like 'df.quantile(0.5, weights="w")'. (I think I would expect this to "drop" the weights column in the result.)

While this is a slightly more complex weighted statistic than those suggested in #7499 (which I may turn my attention to in future!), I believe it's more useful to fix as I don't believe there is an obvious performant workaround. I hope that this operation being somewhat simple and well-defined would justify its presence as not too much of a maintenance burden.

If this feature were acceptable in principle, I would be happy to implement it myself, given a little direction!

I wonder how much integration might be worthwhile. Perhaps functionality could be added in (non-breaking) stages:

Add this to DataFrame only: only accepting optional column name
Same for LazyFrame
Allowing more general inputs than string Exprs, and potentially adding to Series, Expr etc?

Thoughts much appreciated! Thanks, Tom

tmct commented 1 year ago

On reflection, this might not be the most useful feature for me and my team right now, and is probably a little tricky - I might attempt an easier PR first as a first contribution. But it would still be nice to have.

stinodego commented 1 year ago

Could you add an example with some small sample data? I'm not sure what you're looking for exactly.

Possibly, the redesign discussed in #10468 may address this.

s-banach commented 1 year ago

The standard qcut divides a column into buckets containing (approximately) equal counts. Weighted qcut would divide a column into buckets containing (approximately) equal weights.

zundertj commented 1 year ago

Small example of weighted quantile:

>>> df = pl.DataFrame({"value": [1,2,3,4,-1], "weight":[0.1, 0.1, 0.1, 2, 0.2]})
shape: (5, 2)
┌───────┬────────┐
│ value ┆ weight │
│ ---   ┆ ---    │
│ i64   ┆ f64    │
╞═══════╪════════╡
│ 1     ┆ 0.1    │
│ 2     ┆ 0.1    │
│ 3     ┆ 0.1    │
│ 4     ┆ 2.0    │
│ -1    ┆ 0.2    │
└───────┴────────┘

Implementation using current Python api, no interpolations:

quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item()  # returns 4

s-banach commented 1 year ago

Implementation using current Python api, no interpolations:

quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item()  # returns 4

I believe the following two methods are equivalent:

Let cw = w.cumsum() / w.sum(). To find the q quantile, choose the nearest value of cw.
Let cw = (w.cumsum() - 0.5 * w) / w.sum(). To find the q quantile, do backward search-sorted on cw.

I think doing both methods (subtracting 0.5 * w and searching for the nearest value) is putting a hat on a hat.

I believe you can do something like this:

def get_weighted_quantiles(df: pl.DataFrame, q: list[float]):
    quantiles = pl.DataFrame({"q": q}).set_sorted("q")
    df = df.with_columns(pl.col("w").cumsum() / pl.sum("w")).set_sorted("w")
    return quantiles.join_asof(df, left_on="q", right_on="w", strategy="nearest").drop("w")

P.S. Could be wrong, not a statistician.

lorentzenchr commented 1 year ago

Before implementing weighted quantiles, I would suggest to start with weighted mean first! Note that weighted quantiles can turn out to be a rabbit hole as to what the interpretation of weights should be. Even numpy does not (yet!) have it, see https://github.com/numpy/numpy/pull/24254.

pola-rs / polars

Weighed quantiles #10726

Problem description