Open AliPearson opened 1 year ago
Since you mentioned you might be willing to help, here's a PR: https://github.com/pola-rs/polars/pull/9580 that is working on adding use of cut
and qcut
to expressions. Perhaps you can work off of or in conjunction with that?
@AliPearson Are you looking for a version of qcut that bins by evenly spaced quantiles or that creates evenly sized bins the way dplyr::ntile
does? The former, I'm planning to do soon, the latter... I don't get. According tontile
's documentation "it will create evenly sized buckets even if the same value of x ends up in different buckets" which will either do the same thing as even quantiles or arbitrarily bucket your data if you use it on a column with a lot of repeated values.
Just jumping in here, but I found that implementing .percentile()
using the existing .rank()
and .count()
was very straight forward and arguably the more general method, and then it can be filtered to ntiles.
here is some code:
def _percentile(self:pl.Expr, method='average')->pl.Expr:
return self.rank(descending=True, method=method) / self.count()
pl.Expr.percentile = _percentile
ans = (df
.with_columns(credit_score_percentile_by_region = pl.col('credit_score').percentile().over('region'))
)
which will either do the same thing as even quantiles
@magarick Not sure what you mean here...evenly spaced quantiles != evenly sized bins
or arbitrarily bucket your data if you use it on a column with a lot of repeated values.
True but I personally rarely find myself in this situation—plus if they're heavily repeating then they're already somewhat bucketed :)
which will either do the same thing as even quantiles
@magarick Not sure what you mean here...evenly spaced quantiles != evenly sized bins
If you don't have repeated elements the bins should be evenly sized up to a difference of 1 like with ntile (though you're not guaranteed to have all the bigger bins at the front). Even if you do have repeated elements, if there isn't a run of them crossing one of the relevant quantile indices you'll still be fine. And even if a run does cross an index, unless there are a lot you'll be close to correct. There may be some slight variations up to how you estimate the actual quantiles when they fall between values but it's at least very very close.
or arbitrarily bucket your data if you use it on a column with a lot of repeated values.
True but I personally rarely find myself in this situation—plus if they're heavily repeating then they're already somewhat bucketed :)
Right! That's why I'm wondering about the use case. I can't imagine a situation where you'd want to evenly bin your data by a value that has enough repetition that you'll get meaningfully different bin sizes than using quantiles. I get wanting to partition data, but if you need to arbitrarily assign enough values due to repetitions that it makes a difference, maybe that's a bad thing?
It's easy enough to do and I'm happy to do it, but I think it's better to only add if there's a compelling reason as opposed to just having it.
EDIT: From the Pandas documentation you mentioned, it looks like passing an integer to qcut
is doing evenly spaced quantiles, which is different from what ntile
does.
It seems this is what you're trying to do, at least by the example. Am I missing something else?
df=pl.DataFrame({
'age':[2,67,40,32,4,15,82, 99,26,30,50,78]
})
df['age'].qcut([(1+x)/3 for x in range(3)],maintain_order=True, category_label="age_group").drop('break_point')
shape: (12, 2)
┌──────┬───────────────────────────────────┐
│ age ┆ age_group │
│ --- ┆ --- │
│ f64 ┆ cat │
╞══════╪═══════════════════════════════════╡
│ 2.0 ┆ (-inf, 28.666666666666664] │
│ 67.0 ┆ (55.666666666666664, 99.0] │
│ 40.0 ┆ (28.666666666666664, 55.66666666… │
│ 32.0 ┆ (28.666666666666664, 55.66666666… │
│ … ┆ … │
│ 26.0 ┆ (-inf, 28.666666666666664] │
│ 30.0 ┆ (28.666666666666664, 55.66666666… │
│ 50.0 ┆ (28.666666666666664, 55.66666666… │
│ 78.0 ┆ (55.666666666666664, 99.0] │
└──────┴───────────────────────────────────┘
It seems the only difference is that pandas takes the lower bound of the series and polars takes -Inf
Hello! Great post: ntile
is really useful (and we can find the same function in Stata, too). Another big difference (in practice) is that ntile
and pandas
automatically replace the category labels with integers from 1 to the number of categories. This is very useful in finance settings as what we care about is the decile in itself (and not the exact boundaries that define it). Right now this labeling process has to be done manually... Do you think this could be added as a feature?
Problem description
In Pandas,
qcut
can create equally-sized binsPolars'
qcut
qcut seems to behave more likenp.quantile
.Would it be worth the effort to implement something like, giving a super basic finance example:
I would actually be willing to make the PR—just querying the community before I make the investment