Open Teamon9161 opened 2 years ago
Can you make a stackoverlfow question? Then others can learn from the answer.
Sorry to bother you, I'm confused about two things:
(1) Ignoring the issue of null values: I also want to get the rank of each value
within its group
. The only way I can see to do this is by partitioning the dataframe by group
and applying .rank()
separately on each piece. Can you explain how this can be done with .groupby()
? I thought groupby always aggregates each group to a single row.
(2) Since the rank of a null value is clearly supposed to be null, shouldn't it be regarded as a bug that rank
fills nulls without raising a value error?
(1) use over
or (groupby
+ explode
).
import numpy as np
import polars as pl
df = pl.DataFrame({
'group': ['a'] * 3 + ['b'] * 3,
'value': [2, 1, None, 4, 5, 6],
})
df.select(pl.col('value').rank().over('group'))
┌───────┬───────┬──────┐
│ group ┆ value ┆ rank │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f32 │
╞═══════╪═══════╪══════╡
│ a ┆ 2 ┆ 3.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ a ┆ 1 ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ a ┆ null ┆ 1.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 4 ┆ 1.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 6 ┆ 3.0 │
└───────┴───────┴──────┘
(2) It should, but if there is an expression that can set masked values within each group in groupby context, we can easily handle NaN values in other functions.
On Stack Overflow, I've prototyped an approach (with working code) that will hopefully strike an acceptable balance among:
It's a long post. I would have included it here, but the code requires no changes to the Polars API.
And thank you, @ritchie46 for tolerating all the issues I logged recently as I developed this. (#4492,#4444, #4360, #4433 ... to name a few)
Since
rank
does not handle null values, I want to write arank
function that can handle null values.It works well if I didn't use
groupby
since I can usewhen-then-otherwise
to set values.However, in groupby context, the predicate
col("value").is_not_null()
inwhen->then->otherwise
is not an aggregation so I will getUsually I have to make some calculations within each group after
rank
and I am worried about performance if I usepartition_by
to split the DataFrame. So I hope that Polars can have expressions likenp.putmask
or similar functions that can set values within each group.