pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.19k stars 1.95k forks source link

Weighed quantiles #10726

Open tmct opened 1 year ago

tmct commented 1 year ago

Problem description

Hi - I would find it very useful to be able to perform the "quantile" methods on DataFrames (and ideally LazyFrames) with a optional sample weight column please.

Something like 'df.quantile(0.5, weights="w")'. (I think I would expect this to "drop" the weights column in the result.)

While this is a slightly more complex weighted statistic than those suggested in #7499 (which I may turn my attention to in future!), I believe it's more useful to fix as I don't believe there is an obvious performant workaround. I hope that this operation being somewhat simple and well-defined would justify its presence as not too much of a maintenance burden.

If this feature were acceptable in principle, I would be happy to implement it myself, given a little direction!

I wonder how much integration might be worthwhile. Perhaps functionality could be added in (non-breaking) stages:

  1. Add this to DataFrame only: only accepting optional column name
  2. Same for LazyFrame
  3. Allowing more general inputs than string Exprs, and potentially adding to Series, Expr etc?

Thoughts much appreciated! Thanks, Tom

tmct commented 1 year ago

On reflection, this might not be the most useful feature for me and my team right now, and is probably a little tricky - I might attempt an easier PR first as a first contribution. But it would still be nice to have.

stinodego commented 1 year ago

Could you add an example with some small sample data? I'm not sure what you're looking for exactly.

Possibly, the redesign discussed in #10468 may address this.

s-banach commented 1 year ago

The standard qcut divides a column into buckets containing (approximately) equal counts. Weighted qcut would divide a column into buckets containing (approximately) equal weights.

zundertj commented 1 year ago

Small example of weighted quantile:

>>> df = pl.DataFrame({"value": [1,2,3,4,-1], "weight":[0.1, 0.1, 0.1, 2, 0.2]})
shape: (5, 2)
┌───────┬────────┐
│ value ┆ weight │
│ ---   ┆ ---    │
│ i64   ┆ f64    │
╞═══════╪════════╡
│ 1     ┆ 0.1    │
│ 2     ┆ 0.1    │
│ 3     ┆ 0.1    │
│ 4     ┆ 2.0    │
│ -1    ┆ 0.2    │
└───────┴────────┘

Implementation using current Python api, no interpolations:

quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item()  # returns 4
s-banach commented 1 year ago

Implementation using current Python api, no interpolations:

quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item()  # returns 4

I believe the following two methods are equivalent:

I think doing both methods (subtracting 0.5 * w and searching for the nearest value) is putting a hat on a hat.

I believe you can do something like this:

def get_weighted_quantiles(df: pl.DataFrame, q: list[float]):
    quantiles = pl.DataFrame({"q": q}).set_sorted("q")
    df = df.with_columns(pl.col("w").cumsum() / pl.sum("w")).set_sorted("w")
    return quantiles.join_asof(df, left_on="q", right_on="w", strategy="nearest").drop("w")

P.S. Could be wrong, not a statistician.

lorentzenchr commented 1 year ago

Before implementing weighted quantiles, I would suggest to start with weighted mean first! Note that weighted quantiles can turn out to be a rabbit hole as to what the interpretation of weights should be. Even numpy does not (yet!) have it, see https://github.com/numpy/numpy/pull/24254.