pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.28k stars 1.96k forks source link

Ability to set masked values within each group in groupby context #4281

Open Teamon9161 opened 2 years ago

Teamon9161 commented 2 years ago

Since rank does not handle null values, I want to write a rank function that can handle null values.

import numpy as np
import polars as pl

df = pl.DataFrame({
    'group': ['a'] * 3 + ['b'] * 3,
    'value': [2, 1, None, 4, 5, 6],
})
df
shape: (6, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ a     ┆ 2     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ 1     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 6     │
└───────┴───────┘

It works well if I didn't use groupby since I can use when-then-otherwise to set values.

def valid_rank(expr: pl.Expr, reverse=False):
    """handle null values when rank"""
    FLOAT_MAX, FLOAT_MIN = np.finfo(float).max, np.finfo(float).min
    mask = expr.is_null()
    expr = expr.fill_null(FLOAT_MIN) if reverse else expr.fill_null(FLOAT_MAX) 
    return pl.when(~mask).then(expr.rank(reverse=reverse)).otherwise(None)

df.with_column(valid_rank(pl.col('value')))
shape: (6, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ f32   │
╞═══════╪═══════╡
│ a     ┆ 2.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ 1.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 3.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 4.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 5.0   │
└───────┴───────┘

However, in groupby context, the predicate col("value").is_not_null() in when->then->otherwise is not an aggregation so I will get

ComputeError: the predicate 'not(col("value").is_null())' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the groupby operation would

Usually I have to make some calculations within each group after rank and I am worried about performance if I use partition_by to split the DataFrame. So I hope that Polars can have expressions like np.putmask or similar functions that can set values within each group.

def valid_rank(expr: pl.Expr, reverse=False):
    """handle null values when rank"""
    FLOAT_MAX, FLOAT_MIN = np.finfo(float).max, np.finfo(float).min
    mask = expr.is_null()
    expr = expr.fill_null(FLOAT_MIN) if reverse else expr.fill_null(FLOAT_MAX)
    # return pl.putmask(expr.rank(reverse=reverse), mask, None)  # hope
    # return expr.rank(reverse=reverse).set(mask, None)  # hope
ritchie46 commented 2 years ago

Can you make a stackoverlfow question? Then others can learn from the answer.

Teamon9161 commented 2 years ago

How to set masked values within each group in groupby context using py-polars

s-banach commented 2 years ago

Sorry to bother you, I'm confused about two things:

(1) Ignoring the issue of null values: I also want to get the rank of each value within its group. The only way I can see to do this is by partitioning the dataframe by group and applying .rank() separately on each piece. Can you explain how this can be done with .groupby()? I thought groupby always aggregates each group to a single row.

(2) Since the rank of a null value is clearly supposed to be null, shouldn't it be regarded as a bug that rank fills nulls without raising a value error?

Teamon9161 commented 2 years ago

(1) use over or (groupby + explode).

import numpy as np
import polars as pl

df = pl.DataFrame({
    'group': ['a'] * 3 + ['b'] * 3,
    'value': [2, 1, None, 4, 5, 6],
})
df.select(pl.col('value').rank().over('group'))
┌───────┬───────┬──────┐
│ group ┆ value ┆ rank │
│ ---   ┆ ---   ┆ ---  │
│ str   ┆ i64   ┆ f32  │
╞═══════╪═══════╪══════╡
│ a     ┆ 2     ┆ 3.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ a     ┆ 1     ┆ 2.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ a     ┆ null  ┆ 1.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b     ┆ 4     ┆ 1.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b     ┆ 5     ┆ 2.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b     ┆ 6     ┆ 3.0  │
└───────┴───────┴──────┘

(2) It should, but if there is an expression that can set masked values within each group in groupby context, we can easily handle NaN values in other functions.

cbilot commented 2 years ago

On Stack Overflow, I've prototyped an approach (with working code) that will hopefully strike an acceptable balance among:

It's a long post. I would have included it here, but the code requires no changes to the Polars API.

And thank you, @ritchie46 for tolerating all the issues I logged recently as I developed this. (#4492,#4444, #4360, #4433 ... to name a few)