Open nglvnt opened 1 month ago
Can reproduce.
It seems it's just pl.Object
in general.
import polars as pl
example_df = pl.DataFrame(
{
"col_0": [0] * 1002,
"col_1": [0] * 1002,
"col_2": [0] * 1002,
"col_3": [0] * 1002,
"col_4": [0] * 1002,
"col_5": [0] * 1002,
"col_6": ["X", "X"] + ["Y"] * 1000,
"col_7": ["A", "B"] + ["B"] * 1000,
},
schema_overrides={"col_7": pl.Object}
)
(
example_df
.filter(pl.first() == pl.first())
.group_by(pl.first())
.len()
)
thread '<unnamed>' panicked at crates/polars-core/src/series/from.rs:117:22:
called `Option::unwrap()` on a `None` value
PanicException: called `Option::unwrap()` on a `None` value
True! I was so deep in Enum, I have forgotten to check it more generally. With this, the reprex can be further simplified:
import polars as pl
example_df = pl.DataFrame(
{
"col_0": [0] * 1000,
"col_1": [0] * 1000,
"col_2": [0] * 1000,
"col_3": [0] * 1000,
"col_4": [0] * 1000,
"col_5": [0] * 1000,
"col_6": [0] * 1000,
"col_7": [0] * 1000,
},
schema_overrides={"col_7": pl.Object}
)
(
example_df
.filter(pl.first() == pl.first())
.group_by(pl.first())
.len()
)
The same holds as before: with one less row, it works; with one less column, it works.
And one column should appear at least in the filter and the group_by, as the following two examples run successfully:
(
example_df
.filter(True) # or 0 == 0
.group_by(pl.first())
.len()
)
(
example_df
.filter(pl.first() == pl.first())
.group_by(0)
.len()
)
Just on the topic of Enums, is there no way to go from enum.Enum
-> pl.Enum
directly?
This seems to "work", but I thought there was a simpler way.
pl.col("col_7").cast(pl.Enum(Example.__members__.keys())
Seems to work fine on main
Seems to work fine on main
I have updated Polars to 1.13.1, but still get the same error.
Edit: Also tried out with 1.14.0, still got the same error.
One more comment: I can make it run if I rechunk the dataframe after filtering.
example_df = pl.DataFrame(
{
"col_0": [0] * 1000,
"col_1": [0] * 1000,
"col_2": [0] * 1000,
"col_3": [0] * 1000,
"col_4": [0] * 1000,
"col_5": [0] * 1000,
"col_6": [0] * 1000,
"col_7": [0] * 1000,
},
schema_overrides={"col_7": pl.Object}
)
(
example_df
.filter(pl.first() == pl.first())
.rechunk()
.group_by(pl.first())
.len()
)
shape: (1, 2)
┌───────┬──────┐
│ col_0 ┆ len │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═══════╪══════╡
│ 0 ┆ 1000 │
└───────┴──────┘
Just on the topic of Enums, is there no way to go from
enum.Enum
->pl.Enum
directly?This seems to "work", but I thought there was a simpler way.
pl.col("col_7").cast(pl.Enum(Example.__members__.keys())
I am not sure I understand which part of the conversion could be further simplified? Getting the member names can be achieved in different ways ([c.name for c in Example]
or with the undocumented _member_names_
property), but for me, the conversion is as simple as it can be.
I thought maybe .cast(pl.Enum(Example))
could work.
It seems it was added for string enums: https://github.com/pola-rs/polars/issues/19724
(Also, can confirm it still panics on main for me.)
Checks
Reproducible example
Log output
Issue description
There seems to be an error with filtering-and-aggregating when having a column containing a Python Enum object (with polars.Object as the datatype of the column). I guess the problem is caused by the fact that the Enum has 2 members and initially both are present in the dataframe, but one of the members gets filtered out, later causing trouble when trying to aggregate.
However, it also does not work if we change the order of enum conversion and filtering:
This gives the same error as in the log output.
It might be also useful to know that the example above is the minimal from row and column number perspective:
Expected behavior
The query just runs without an error.
Installed versions