pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.11k stars 1.83k forks source link

Adding `descending` parameter to `Expr.over` #17674

Open mcrumiller opened 1 month ago

mcrumiller commented 1 month ago

Description

Expr.over() has an order_by parameter, but I cannot see an easy way to use a descending sort.

cmdlineluser commented 1 month ago

~Would it be the same as order_by=expr.reverse()?~

Silly me, sorry for the noise.

Arengard commented 1 month ago
import polars as pl
from polars import col as _

data = pl.DataFrame(
[
pl.Series('row_id', [3, 2, 6, 1], dtype=pl.Int64),
pl.Series('city', ['Henderson', 'Henderson', 'Los Angeles', 'Los Angeles'], dtype=pl.String),
pl.Series('sales', [261.96, 48.86, 14.62, 731.94], dtype=pl.Float64),
]
)

print(
data.with_columns(cum_sum = _.sales.cum_sum().over(partition_by=_.city, order_by=[_.city,_.row_id.reverse()]))
)
#same as
print(
data.with_columns(cum_sum = _.sales.cum_sum().over(partition_by=_.city, order_by=[_.city,_.row_id]))
)
mcrumiller commented 1 month ago

That is not obvious at all, sorting by a reversed column? See the following:

import polars as pl
from polars import col

df = pl.DataFrame({"a": [1, 4, 2, 3]})
print(df.sort(by=col("a").reverse())
shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 3   │
│ 4   │
│ 1   │
│ 2   │
└─────┘

This is most definitely not a descending sort.

ritchie46 commented 1 month ago

I believe that won't work as it should be reversed within the groups.

mcrumiller commented 1 month ago

@ritchie46 --right. Is there a way to do a descending sort within groups via over, or is this issue justified?

tpaxman commented 1 month ago

I ran into this same issue today so +1 for this.

A use case for me is finding the row associated with the maximum value for each group.

Because there are cases where a group's maximum value occurs on multiple rows, using a this approach is not sufficient if I want only one row preserved per group:

.filter(value_col=pl.col('value_col').max().over('group_id')

Instead, I would want to do something the following (noting beforehand that the syntax for the order_by argument is not valid:

.filter(row_id=pl.col('row_id').first().over('group_id', order_by=('value_col', 'desc'))

In my case, this can be solved by doing this instead, where reversing sorting is not needed:

.filter(row_id=pl.col('row_id').last().over('group_id', order_by='value_col')

However, this would not necessarily be a possible solution for all scenarios, such as cases where needing to order by multiple columns in some combination of ascending and descending manners.