pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.25k stars 1.95k forks source link

Behaviour change of Expr.list.drop_nulls for structs with all None fields #18230

Open rebeccafair opened 2 months ago

rebeccafair commented 2 months ago

Checks

Reproducible example

Input

import polars as pl

frame = pl.DataFrame([
    {'a': [{'b': 1}, {'b': 2}, {'b': None}]},
    {'a': [{'b': 1}, {'b': 2}, None]}
])

print(frame.select(pl.col('a').list.drop_nulls().list.len()))

Output (Polars 1.1.0)

shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ u32 │
╞═════╡
│ 2   │
│ 2   │
└─────┘

Output (Polars 1.2.1 - 1.5.0)

shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ u32 │
╞═════╡
│ 3   │
│ 2   │
└─────┘

Issue description

Between Polars 1.1.0 and 1.2.1 it appears the behaviour of Expr.list.drop_nulls has changed, before if the list contained a struct with all null fields it would be dropped, but from 1.2.1 it appears it is only dropped if the entire struct is null.

Is this a bug in Polars >= 1.2.1 or was the < 1.2.1 behaviour incorrect? The documentation for this function hasn't changed to indicate this as far as I can see.

cmdlineluser commented 2 months ago

I think this is a side effect of the struct reimplementation and the previous behaviour was considered incorrect?

Before:

pl.Series([{"a": 1}, None, {"a": None}]).is_null()
# shape: (3,)
# Series: '' [bool]
# [
#   false
#   true
#   true
# ]

After:

pl.Series([{"a": 1}, None, {"a": None}]).is_null()
# shape: (3,)
# Series: '' [bool]
# [
#   false
#   true
#   false
# ]

Which does raise an interesting question:

How does one test if all values in a struct are null?

whom commented 2 months ago

We were updating Polars and while I understand the logic behind this change (a non-null struct/object is technically not null, even if it's just full of null data), this is breaking a lot of our workflows which were built on the assumption that a struct full of nulls would be evaluated to null.

Is this an intended side effect? What is the proposed workaround? Any advice or tips would be greatly appreciated.

We're downgrading to v1.1 in the meantime. Thank you!

As an aside, I'm surprised this was not considered a breaking change.

jcmuel commented 2 months ago

@whom : We are applying the following function to struct columns before those structs get put into the lists and then later we call drop_nulls. (Original code from @rebeccafair.)

def set_struct_with_all_null_fields_to_null(frame: pl.DataFrame, struct_col: str) -> pl.DataFrame:
    """
    Set any structs to null that have all null fields.

    WARNING
    -------
    The function only checks for null in the current struct fields. It doesn't do
    recursive checks on structs inside the struct that could also have all null fields.

    Parameters
    ----------
    frame: pl.DataFrame
        The frame to modify.
    struct_col: str
        The name of the struct column to modify.

    Returns
    -------
    pl.DataFrame
        Modified DataFrame.
    """

    # If any struct field is non-null, then keep the struct, otherwise replace it by null.
    result = frame.with_columns(
        pl.when(pl.any_horizontal(pl.col(struct_col).struct.field('*').is_not_null()))
        .then(pl.col(struct_col))
        .alias(struct_col)
    )

    return result
whom commented 1 month ago

@whom : We are applying the following function to struct columns before those structs get put into the lists and then later we call drop_nulls. (Original code from @rebeccafair.)

def set_struct_with_all_null_fields_to_null(frame: pl.DataFrame, struct_col: str) -> pl.DataFrame:
    """
    Set any structs to null that have all null fields.

    WARNING
    -------
    The function only checks for null in the current struct fields. It doesn't do
    recursive checks on structs inside the struct that could also have all null fields.

    Parameters
    ----------
    frame: pl.DataFrame
        The frame to modify.
    struct_col: str
        The name of the struct column to modify.

    Returns
    -------
    pl.DataFrame
        Modified DataFrame.
    """

    # If any struct field is non-null, then keep the struct, otherwise replace it by null.
    result = frame.with_columns(
        pl.when(pl.any_horizontal(pl.col(struct_col).struct.field('*').is_not_null()))
        .then(pl.col(struct_col))
        .alias(struct_col)
    )

    return result

Thank you so much for sharing this! I wanted to pay it forward by sharing a slight tweak so that this function can handle nested structs. I made it controllable via parameter since it can lead to some surprises.

I'm also a newbie at working with Data Frames and Polars, so I'm sure there's a far better way to implement this.

def set_null_struct_to_null(frame: polars.DataFrame, struct_col: str, recursive: bool) -> polars.DataFrame:
    """Set any structs to null that have all null fields.

    Args:
    frame (polars.DataFrame): The Polars DataFrame to modify.
    struct_col (str): The name of the column to evaluate.
    recursive (bool): Whether to recursively check for null fields in structs inside the struct.

    Returns:
    polars.DataFrame: Modified Polars DataFrame.
    """
    # If the type of the column is not a struct, just return the frame.
    if frame[struct_col].dtype != polars.Struct:
        return frame

    # Iterate across all fields in the struct. Recursively call if the field itself is a struct.
    if recursive:
        unnested_struct = frame[struct_col].struct.unnest()
        for column in unnested_struct.columns:
            if unnested_struct[column].dtype == polars.Struct:
                frame = frame.with_columns(
                    unnested_struct.with_columns(set_null_struct_to_null(unnested_struct[column].to_frame(), column, recursive)).to_struct(
                        struct_col
                    )
                )

    # Otherwise, convert the struct to null if all fields are null.
    return frame.with_columns(
        polars.when(polars.any_horizontal(polars.col(struct_col).struct.field("*").is_not_null()))
        .then(polars.col(struct_col))
        .alias(struct_col)
    )