Open tmct opened 1 year ago
On reflection, this might not be the most useful feature for me and my team right now, and is probably a little tricky - I might attempt an easier PR first as a first contribution. But it would still be nice to have.
Could you add an example with some small sample data? I'm not sure what you're looking for exactly.
Possibly, the redesign discussed in #10468 may address this.
The standard qcut divides a column into buckets containing (approximately) equal counts. Weighted qcut would divide a column into buckets containing (approximately) equal weights.
Small example of weighted quantile:
>>> df = pl.DataFrame({"value": [1,2,3,4,-1], "weight":[0.1, 0.1, 0.1, 2, 0.2]})
shape: (5, 2)
┌───────┬────────┐
│ value ┆ weight │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═══════╪════════╡
│ 1 ┆ 0.1 │
│ 2 ┆ 0.1 │
│ 3 ┆ 0.1 │
│ 4 ┆ 2.0 │
│ -1 ┆ 0.2 │
└───────┴────────┘
Implementation using current Python api, no interpolations:
quantile = 0.5
dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum())
dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item() # returns 4
Implementation using current Python api, no interpolations:
quantile = 0.5 dfs = df.sort("value").with_columns(cumw=(pl.col("weight").cumsum() - 0.5*pl.col("weight"))/pl.col("weight").sum()) dfs.select(pl.col("value").sort_by((pl.col("cumw")-quantile).abs()).first()).item() # returns 4
I believe the following two methods are equivalent:
cw = w.cumsum() / w.sum()
. To find the q
quantile, choose the nearest value of cw
.cw = (w.cumsum() - 0.5 * w) / w.sum()
. To find the q
quantile, do backward search-sorted on cw
.I think doing both methods (subtracting 0.5 * w
and searching for the nearest value) is putting a hat on a hat.
I believe you can do something like this:
def get_weighted_quantiles(df: pl.DataFrame, q: list[float]):
quantiles = pl.DataFrame({"q": q}).set_sorted("q")
df = df.with_columns(pl.col("w").cumsum() / pl.sum("w")).set_sorted("w")
return quantiles.join_asof(df, left_on="q", right_on="w", strategy="nearest").drop("w")
P.S. Could be wrong, not a statistician.
Before implementing weighted quantiles, I would suggest to start with weighted mean first! Note that weighted quantiles can turn out to be a rabbit hole as to what the interpretation of weights should be. Even numpy does not (yet!) have it, see https://github.com/numpy/numpy/pull/24254.
Problem description
Hi - I would find it very useful to be able to perform the "quantile" methods on DataFrames (and ideally LazyFrames) with a optional sample weight column please.
Something like 'df.quantile(0.5, weights="w")'. (I think I would expect this to "drop" the weights column in the result.)
While this is a slightly more complex weighted statistic than those suggested in #7499 (which I may turn my attention to in future!), I believe it's more useful to fix as I don't believe there is an obvious performant workaround. I hope that this operation being somewhat simple and well-defined would justify its presence as not too much of a maintenance burden.
If this feature were acceptable in principle, I would be happy to implement it myself, given a little direction!
I wonder how much integration might be worthwhile. Perhaps functionality could be added in (non-breaking) stages:
Thoughts much appreciated! Thanks, Tom