Open stevenlis opened 1 year ago
This ties into broader discussion on null handling across all polars APIs: https://github.com/pola-rs/polars/issues/10016
I would say it's the main biggest inconsistency that I'm seeing while using polars.
Pandas default behavior would drop nulls
We should definitely not do the same ;)
The fact that a key is null doesn't mean that it's reasonable for the rest of the data associated with that key to vanish following a group operation. If you take a look at SQL databases, for example, you'll find that they all treat null
as a real key during group by
operations and associate data accordingly - pandas is actually something of an outlier here.
For some bonus context: when I worked at JPMorgan this pandas behaviour was actually identified as a serious risk as soon as people realised that it wasn't following database norms, and various internal data APIs were actively sanitised against it, alongside expressions of disbelief that (a) this was the default, and (b) there appeared to be no way to disable that behaviour (this was a few years ago, the dropna
param didn't exist yet).
@alexander-beedie Thanks for the detailed explanation. In my opinion, "null" indicates that the group is just unknown and missing. In other words, results aggregated into the "null" group are not valid because it is not an actual group. For instance, while cleaning wage data, I found some industry/NACIS codes are missing, which does not imply that they all belong to the same industry. From what I observe, pandas returns other non-null groups as expected. I don't see any problem here. If dropping nulls should be the default, then there should at least be a parameter for users to choose.
Problem description
As of polars
'0.19.1'
,group_by
andexpr.over()
treat anull
value as a separate and valid group key."Group" null is included in the results:
I suggest returning the results as
null
when the group key isnull
for more desirable outcomes when using.over()
.Currently, adding a filter will return
0
instead ofnull
Pandas's default behavior would drop
nulls
.The bottom line is that currently in Polars, we don't have a way to control the group key at the
group_by
level.