pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.65k stars 1.99k forks source link

qcut to make equally-sized bins (just like R's ntile) #9696

Open AliPearson opened 1 year ago

AliPearson commented 1 year ago

Problem description

In Pandas, qcut can create equally-sized bins

image

Polars' qcut qcut seems to behave more like np.quantile.

Would it be worth the effort to implement something like, giving a super basic finance example:

df.select(pl.col('credit_score').ntile(10).alias('credit_score_bucket'))

I would actually be willing to make the PR—just querying the community before I make the investment

avimallu commented 1 year ago

Since you mentioned you might be willing to help, here's a PR: https://github.com/pola-rs/polars/pull/9580 that is working on adding use of cut and qcut to expressions. Perhaps you can work off of or in conjunction with that?

magarick commented 1 year ago

@AliPearson Are you looking for a version of qcut that bins by evenly spaced quantiles or that creates evenly sized bins the way dplyr::ntile does? The former, I'm planning to do soon, the latter... I don't get. According tontile's documentation "it will create evenly sized buckets even if the same value of x ends up in different buckets" which will either do the same thing as even quantiles or arbitrarily bucket your data if you use it on a column with a lot of repeated values.

mkleinbort-ic commented 1 year ago

Just jumping in here, but I found that implementing .percentile() using the existing .rank() and .count() was very straight forward and arguably the more general method, and then it can be filtered to ntiles.

here is some code:


def _percentile(self:pl.Expr, method='average')->pl.Expr:
    return self.rank(descending=True, method=method) / self.count()

pl.Expr.percentile = _percentile 

ans = (df
    .with_columns(credit_score_percentile_by_region = pl.col('credit_score').percentile().over('region'))
)
AliPearson commented 1 year ago

which will either do the same thing as even quantiles

@magarick Not sure what you mean here...evenly spaced quantiles != evenly sized bins

or arbitrarily bucket your data if you use it on a column with a lot of repeated values.

True but I personally rarely find myself in this situation—plus if they're heavily repeating then they're already somewhat bucketed :)

magarick commented 1 year ago

which will either do the same thing as even quantiles

@magarick Not sure what you mean here...evenly spaced quantiles != evenly sized bins

If you don't have repeated elements the bins should be evenly sized up to a difference of 1 like with ntile (though you're not guaranteed to have all the bigger bins at the front). Even if you do have repeated elements, if there isn't a run of them crossing one of the relevant quantile indices you'll still be fine. And even if a run does cross an index, unless there are a lot you'll be close to correct. There may be some slight variations up to how you estimate the actual quantiles when they fall between values but it's at least very very close.

or arbitrarily bucket your data if you use it on a column with a lot of repeated values.

True but I personally rarely find myself in this situation—plus if they're heavily repeating then they're already somewhat bucketed :)

Right! That's why I'm wondering about the use case. I can't imagine a situation where you'd want to evenly bin your data by a value that has enough repetition that you'll get meaningfully different bin sizes than using quantiles. I get wanting to partition data, but if you need to arbitrarily assign enough values due to repetitions that it makes a difference, maybe that's a bad thing?

It's easy enough to do and I'm happy to do it, but I think it's better to only add if there's a compelling reason as opposed to just having it.

EDIT: From the Pandas documentation you mentioned, it looks like passing an integer to qcut is doing evenly spaced quantiles, which is different from what ntile does.

deanm0000 commented 1 year ago

It seems this is what you're trying to do, at least by the example. Am I missing something else?

df=pl.DataFrame({
    'age':[2,67,40,32,4,15,82, 99,26,30,50,78]
})
df['age'].qcut([(1+x)/3 for x in range(3)],maintain_order=True, category_label="age_group").drop('break_point')
shape: (12, 2)
┌──────┬───────────────────────────────────┐
│ age  ┆ age_group                         │
│ ---  ┆ ---                               │
│ f64  ┆ cat                               │
╞══════╪═══════════════════════════════════╡
│ 2.0  ┆ (-inf, 28.666666666666664]        │
│ 67.0 ┆ (55.666666666666664, 99.0]        │
│ 40.0 ┆ (28.666666666666664, 55.66666666… │
│ 32.0 ┆ (28.666666666666664, 55.66666666… │
│ …    ┆ …                                 │
│ 26.0 ┆ (-inf, 28.666666666666664]        │
│ 30.0 ┆ (28.666666666666664, 55.66666666… │
│ 50.0 ┆ (28.666666666666664, 55.66666666… │
│ 78.0 ┆ (55.666666666666664, 99.0]        │
└──────┴───────────────────────────────────┘

It seems the only difference is that pandas takes the lower bound of the series and polars takes -Inf

randomgambit commented 1 year ago

Hello! Great post: ntile is really useful (and we can find the same function in Stata, too). Another big difference (in practice) is that ntile and pandas automatically replace the category labels with integers from 1 to the number of categories. This is very useful in finance settings as what we care about is the decile in itself (and not the exact boundaries that define it). Right now this labeling process has to be done manually... Do you think this could be added as a feature?