pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.53k stars 1.98k forks source link

Filter-and-aggregate error with Object column #19085

Open nglvnt opened 1 month ago

nglvnt commented 1 month ago

Checks

Reproducible example

from enum import Enum
import polars as pl

example_df = pl.DataFrame(
    {
        "col_0": [0] * 1002,
        "col_1": [0] * 1002,
        "col_2": [0] * 1002,
        "col_3": [0] * 1002,
        "col_4": [0] * 1002,
        "col_5": [0] * 1002,
        "col_6": ["X", "X"] + ["Y"] * 1000,
        "col_7": ["A", "B"] + ["B"] * 1000,
    }
)

class Example(Enum):
    A = 0
    B = 1

(
    example_df
    .with_columns(pl.col("col_7").map_elements(lambda x: Example[x], return_dtype=pl.Object))
    .filter(pl.col("col_6") == "Y")
    .group_by("col_0") # throws error with all columns, even col_6 and col_7
    .len()
)

Log output

dataframe filtered
estimated unique values: 1
run PARTITIONED HASH AGGREGATION
thread '<unnamed>' panicked at crates\polars-core\src\series\from.rs:117:22:
called `Option::unwrap()` on a `None` value
Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File "C:\projects\IIM\.venv\lib\site-packages\polars\dataframe\group_by.py", line 469, in len
    return self.agg(len_expr)
  File "C:\projects\IIM\.venv\lib\site-packages\polars\dataframe\group_by.py", line 229, in agg
    self.df.lazy()
  File "C:\projects\IIM\.venv\lib\site-packages\polars\lazyframe\frame.py", line 2050, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Issue description

There seems to be an error with filtering-and-aggregating when having a column containing a Python Enum object (with polars.Object as the datatype of the column). I guess the problem is caused by the fact that the Enum has 2 members and initially both are present in the dataframe, but one of the members gets filtered out, later causing trouble when trying to aggregate.

However, it also does not work if we change the order of enum conversion and filtering:

(
    example_df
    .filter(pl.col("col_6") == "Y")
    .with_columns(pl.col("col_7").map_elements(lambda x: Example[x], return_dtype=pl.Object))
    .group_by("col_0") # throws error with all columns, even col_6 and col_7
    .len()
)

This gives the same error as in the log output.

It might be also useful to know that the example above is the minimal from row and column number perspective:

(
    example_df[0:1001]
    .with_columns(pl.col("col_7").map_elements(lambda x: Example[x], return_dtype=pl.Object))
    .filter(pl.col("col_6") == "Y")
    .group_by("col_0")
    .len()
)
dataframe filtered
DATAFRAME < 1000 rows: running default HASH AGGREGATION
shape: (1, 2)
┌───────┬─────┐
│ col_0 ┆ len │
│ ---   ┆ --- │
│ i64   ┆ u32 │
╞═══════╪═════╡
│ 0     ┆ 999 │
└───────┴─────┘
(
    example_df
    .drop("col_0")
    .with_columns(pl.col("col_7").map_elements(lambda x: Example[x], return_dtype=pl.Object))
    .filter(pl.col("col_6") == "Y")
    .group_by("col_1")
    .len()
)
dataframe filtered
estimated unique values: 1
run PARTITIONED HASH AGGREGATION
group_by keys are sorted; running sorted key fast path
shape: (1, 2)
┌───────┬──────┐
│ col_1 ┆ len  │
│ ---   ┆ ---  │
│ i64   ┆ u32  │
╞═══════╪══════╡
│ 0     ┆ 1000 │
└───────┴──────┘

Expected behavior

The query just runs without an error.

Installed versions

``` --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager altair 4.2.2 cloudpickle connectorx 0.3.2 deltalake fastexcel fsspec 2023.10.0 gevent great_tables matplotlib 3.8.1 nest_asyncio 1.5.8 numpy 1.26.4 openpyxl 3.1.2 pandas 2.2.1 pyarrow 12.0.1 pydantic 1.10.13 pyiceberg sqlalchemy 2.0.23 torch xlsx2csv xlsxwriter ```
cmdlineluser commented 1 month ago

Can reproduce.

It seems it's just pl.Object in general.

import polars as pl

example_df = pl.DataFrame(
    {
        "col_0": [0] * 1002,
        "col_1": [0] * 1002,
        "col_2": [0] * 1002,
        "col_3": [0] * 1002,
        "col_4": [0] * 1002,
        "col_5": [0] * 1002,
        "col_6": ["X", "X"] + ["Y"] * 1000,
        "col_7": ["A", "B"] + ["B"] * 1000,
    },
    schema_overrides={"col_7": pl.Object}
)

(
    example_df
    .filter(pl.first() == pl.first())
    .group_by(pl.first())
    .len()
)
thread '<unnamed>' panicked at crates/polars-core/src/series/from.rs:117:22:
called `Option::unwrap()` on a `None` value
PanicException: called `Option::unwrap()` on a `None` value
nglvnt commented 1 month ago

True! I was so deep in Enum, I have forgotten to check it more generally. With this, the reprex can be further simplified:

import polars as pl

example_df = pl.DataFrame(
    {
        "col_0": [0] * 1000,
        "col_1": [0] * 1000,
        "col_2": [0] * 1000,
        "col_3": [0] * 1000,
        "col_4": [0] * 1000,
        "col_5": [0] * 1000,
        "col_6": [0] * 1000,
        "col_7": [0] * 1000,
    },
    schema_overrides={"col_7": pl.Object}
)

(
    example_df
    .filter(pl.first() == pl.first())
    .group_by(pl.first())
    .len()
)

The same holds as before: with one less row, it works; with one less column, it works.

And one column should appear at least in the filter and the group_by, as the following two examples run successfully:

(
    example_df
    .filter(True) # or 0 == 0
    .group_by(pl.first())
    .len()
)

(
    example_df
    .filter(pl.first() == pl.first())
    .group_by(0)
    .len()
)
cmdlineluser commented 1 month ago

Just on the topic of Enums, is there no way to go from enum.Enum -> pl.Enum directly?

This seems to "work", but I thought there was a simpler way.

pl.col("col_7").cast(pl.Enum(Example.__members__.keys())
gab23r commented 1 week ago

Seems to work fine on main

nglvnt commented 1 week ago

Seems to work fine on main

I have updated Polars to 1.13.1, but still get the same error.

Edit: Also tried out with 1.14.0, still got the same error.

nglvnt commented 1 week ago

One more comment: I can make it run if I rechunk the dataframe after filtering.

example_df = pl.DataFrame(
    {
        "col_0": [0] * 1000,
        "col_1": [0] * 1000,
        "col_2": [0] * 1000,
        "col_3": [0] * 1000,
        "col_4": [0] * 1000,
        "col_5": [0] * 1000,
        "col_6": [0] * 1000,
        "col_7": [0] * 1000,
    },
    schema_overrides={"col_7": pl.Object}
)

(
    example_df
    .filter(pl.first() == pl.first())
    .rechunk()
    .group_by(pl.first())
    .len()
)

shape: (1, 2)
┌───────┬──────┐
│ col_0 ┆ len  │
│ ---   ┆ ---  │
│ i64   ┆ u32  │
╞═══════╪══════╡
│ 0     ┆ 1000 │
└───────┴──────┘
nglvnt commented 1 week ago

Just on the topic of Enums, is there no way to go from enum.Enum -> pl.Enum directly?

This seems to "work", but I thought there was a simpler way.

pl.col("col_7").cast(pl.Enum(Example.__members__.keys())

I am not sure I understand which part of the conversion could be further simplified? Getting the member names can be achieved in different ways ([c.name for c in Example] or with the undocumented _member_names_ property), but for me, the conversion is as simple as it can be.

cmdlineluser commented 2 days ago

I thought maybe .cast(pl.Enum(Example)) could work.

It seems it was added for string enums: https://github.com/pola-rs/polars/issues/19724

(Also, can confirm it still panics on main for me.)