pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.09k stars 1.94k forks source link

Add `drop_nulls` param for `group_by()` and `expr.over()` #11030

Open stevenlis opened 1 year ago

stevenlis commented 1 year ago

Problem description

As of polars '0.19.1', group_by and expr.over() treat a null value as a separate and valid group key.

import polars as pl
import pandas as pd

data = {
    'a': ['a', 'a', 'b', 'b', 'c', 'c'],
    'b': ['1', '1', '2', '2', None, None],
    'c': [1, 2, 1, 2, 1, 1]
}
df_pl = pl.DataFrame(data=data)
df_pd = pd.DataFrame(data=data)

df_pl.group_by(['a', 'b']).agg(pl.col('c').sum().alias('c_sum'))

df_pl.with_columns(
    pl.col('c').sum().over(['a', 'b']).alias('c_sum')
)

"Group" null is included in the results:

shape: (3, 3)
┌─────┬──────┬───────┐
│ a   ┆ b    ┆ c_sum │
│ --- ┆ ---  ┆ ---   │
│ str ┆ str  ┆ i64   │
╞═════╪══════╪═══════╡
│ b   ┆ 2    ┆ 3     │
│ c   ┆ null ┆ 2     │
│ a   ┆ 1    ┆ 3     │
└─────┴──────┴───────┘

shape: (6, 4)
┌─────┬──────┬─────┬───────┐
│ a   ┆ b    ┆ c   ┆ c_sum │
│ --- ┆ ---  ┆ --- ┆ ---   │
│ str ┆ str  ┆ i64 ┆ i64   │
╞═════╪══════╪═════╪═══════╡
│ a   ┆ 1    ┆ 1   ┆ 3     │
│ a   ┆ 1    ┆ 2   ┆ 3     │
│ b   ┆ 2    ┆ 1   ┆ 3     │
│ b   ┆ 2    ┆ 2   ┆ 3     │
│ c   ┆ null ┆ 1   ┆ 2     │
│ c   ┆ null ┆ 1   ┆ 2     │
└─────┴──────┴─────┴───────┘

I suggest returning the results as null when the group key is null for more desirable outcomes when using .over().

┌─────┬──────┬─────┬───────┐
│ a   ┆ b    ┆ c   ┆ c_sum │
│ --- ┆ ---  ┆ --- ┆ ---   │
│ str ┆ str  ┆ i64 ┆ i64   │
╞═════╪══════╪═════╪═══════╡
│ a   ┆ 1    ┆ 1   ┆ 3     │
│ a   ┆ 1    ┆ 2   ┆ 3     │
│ b   ┆ 2    ┆ 1   ┆ 3     │
│ b   ┆ 2    ┆ 2   ┆ 3     │
│ c   ┆ null ┆ 1   ┆ null  │
│ c   ┆ null ┆ 1   ┆ null  │
└─────┴──────┴─────┴───────┘

Currently, adding a filter will return 0 instead of null

df_pl.with_columns(
    pl.col('c').filter(pl.col('b').is_not_null()).sum()
    .over(['a', 'b']).alias('c_sum')
)
shape: (6, 4)
┌─────┬──────┬─────┬───────┐
│ a   ┆ b    ┆ c   ┆ c_sum │
│ --- ┆ ---  ┆ --- ┆ ---   │
│ str ┆ str  ┆ i64 ┆ i64   │
╞═════╪══════╪═════╪═══════╡
│ a   ┆ 1    ┆ 1   ┆ 3     │
│ a   ┆ 1    ┆ 2   ┆ 3     │
│ b   ┆ 2    ┆ 1   ┆ 3     │
│ b   ┆ 2    ┆ 2   ┆ 3     │
│ c   ┆ null ┆ 1   ┆ 0     │
│ c   ┆ null ┆ 1   ┆ 0     │
└─────┴──────┴─────┴───────┘

Pandas's default behavior would drop nulls.

df_pd.groupby(['a', 'b']).c.sum()

df_pd.assign(
    c_sum=df_pd.groupby(['a', 'b']).c.transform('sum')
)
a  b
a  1    3
b  2    3
Name: c, dtype: int64

   a     b  c  c_sum
0  a     1  1    3.0
1  a     1  2    3.0
2  b     2  1    3.0
3  b     2  2    3.0
4  c  None  1    NaN
5  c  None  1    NaN

The bottom line is that currently in Polars, we don't have a way to control the group key at the group_by level.

ion-elgreco commented 1 year ago

This ties into broader discussion on null handling across all polars APIs: https://github.com/pola-rs/polars/issues/10016

I would say it's the main biggest inconsistency that I'm seeing while using polars.

alexander-beedie commented 1 year ago

Pandas default behavior would drop nulls

We should definitely not do the same ;)

The fact that a key is null doesn't mean that it's reasonable for the rest of the data associated with that key to vanish following a group operation. If you take a look at SQL databases, for example, you'll find that they all treat null as a real key during group by operations and associate data accordingly - pandas is actually something of an outlier here.

For some bonus context: when I worked at JPMorgan this pandas behaviour was actually identified as a serious risk as soon as people realised that it wasn't following database norms, and various internal data APIs were actively sanitised against it, alongside expressions of disbelief that (a) this was the default, and (b) there appeared to be no way to disable that behaviour (this was a few years ago, the dropna param didn't exist yet).

stevenlis commented 1 year ago

@alexander-beedie Thanks for the detailed explanation. In my opinion, "null" indicates that the group is just unknown and missing. In other words, results aggregated into the "null" group are not valid because it is not an actual group. For instance, while cleaning wage data, I found some industry/NACIS codes are missing, which does not imply that they all belong to the same industry. From what I observe, pandas returns other non-null groups as expected. I don't see any problem here. If dropping nulls should be the default, then there should at least be a parameter for users to choose.