Open rebeccafair opened 2 months ago
I think this is a side effect of the struct reimplementation and the previous behaviour was considered incorrect?
Before:
pl.Series([{"a": 1}, None, {"a": None}]).is_null()
# shape: (3,)
# Series: '' [bool]
# [
# false
# true
# true
# ]
After:
pl.Series([{"a": 1}, None, {"a": None}]).is_null()
# shape: (3,)
# Series: '' [bool]
# [
# false
# true
# false
# ]
Which does raise an interesting question:
How does one test if all values in a struct are null?
We were updating Polars and while I understand the logic behind this change (a non-null struct/object is technically not null, even if it's just full of null data), this is breaking a lot of our workflows which were built on the assumption that a struct full of nulls would be evaluated to null.
Is this an intended side effect? What is the proposed workaround? Any advice or tips would be greatly appreciated.
We're downgrading to v1.1 in the meantime. Thank you!
As an aside, I'm surprised this was not considered a breaking change.
@whom :
We are applying the following function to struct columns before those structs get put into the lists and then later we call drop_nulls
. (Original code from @rebeccafair.)
def set_struct_with_all_null_fields_to_null(frame: pl.DataFrame, struct_col: str) -> pl.DataFrame:
"""
Set any structs to null that have all null fields.
WARNING
-------
The function only checks for null in the current struct fields. It doesn't do
recursive checks on structs inside the struct that could also have all null fields.
Parameters
----------
frame: pl.DataFrame
The frame to modify.
struct_col: str
The name of the struct column to modify.
Returns
-------
pl.DataFrame
Modified DataFrame.
"""
# If any struct field is non-null, then keep the struct, otherwise replace it by null.
result = frame.with_columns(
pl.when(pl.any_horizontal(pl.col(struct_col).struct.field('*').is_not_null()))
.then(pl.col(struct_col))
.alias(struct_col)
)
return result
@whom : We are applying the following function to struct columns before those structs get put into the lists and then later we call
drop_nulls
. (Original code from @rebeccafair.)def set_struct_with_all_null_fields_to_null(frame: pl.DataFrame, struct_col: str) -> pl.DataFrame: """ Set any structs to null that have all null fields. WARNING ------- The function only checks for null in the current struct fields. It doesn't do recursive checks on structs inside the struct that could also have all null fields. Parameters ---------- frame: pl.DataFrame The frame to modify. struct_col: str The name of the struct column to modify. Returns ------- pl.DataFrame Modified DataFrame. """ # If any struct field is non-null, then keep the struct, otherwise replace it by null. result = frame.with_columns( pl.when(pl.any_horizontal(pl.col(struct_col).struct.field('*').is_not_null())) .then(pl.col(struct_col)) .alias(struct_col) ) return result
Thank you so much for sharing this! I wanted to pay it forward by sharing a slight tweak so that this function can handle nested structs. I made it controllable via parameter since it can lead to some surprises.
I'm also a newbie at working with Data Frames and Polars, so I'm sure there's a far better way to implement this.
def set_null_struct_to_null(frame: polars.DataFrame, struct_col: str, recursive: bool) -> polars.DataFrame:
"""Set any structs to null that have all null fields.
Args:
frame (polars.DataFrame): The Polars DataFrame to modify.
struct_col (str): The name of the column to evaluate.
recursive (bool): Whether to recursively check for null fields in structs inside the struct.
Returns:
polars.DataFrame: Modified Polars DataFrame.
"""
# If the type of the column is not a struct, just return the frame.
if frame[struct_col].dtype != polars.Struct:
return frame
# Iterate across all fields in the struct. Recursively call if the field itself is a struct.
if recursive:
unnested_struct = frame[struct_col].struct.unnest()
for column in unnested_struct.columns:
if unnested_struct[column].dtype == polars.Struct:
frame = frame.with_columns(
unnested_struct.with_columns(set_null_struct_to_null(unnested_struct[column].to_frame(), column, recursive)).to_struct(
struct_col
)
)
# Otherwise, convert the struct to null if all fields are null.
return frame.with_columns(
polars.when(polars.any_horizontal(polars.col(struct_col).struct.field("*").is_not_null()))
.then(polars.col(struct_col))
.alias(struct_col)
)
Checks
Reproducible example
Input
Output (Polars 1.1.0)
Output (Polars 1.2.1 - 1.5.0)
Issue description
Between Polars 1.1.0 and 1.2.1 it appears the behaviour of Expr.list.drop_nulls has changed, before if the list contained a struct with all null fields it would be dropped, but from 1.2.1 it appears it is only dropped if the entire struct is null.
Is this a bug in Polars >= 1.2.1 or was the < 1.2.1 behaviour incorrect? The documentation for this function hasn't changed to indicate this as far as I can see.