pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.52k stars 1.98k forks source link

`join` + `group_by` segfault with Enum type #19950

Closed cmdlineluser closed 5 hours ago

cmdlineluser commented 23 hours ago

Checks

Reproducible example

import polars as pl

dtype = pl.Enum(categories=["a", "b", "c"])

l = pl.DataFrame({"x": "a"}).cast(dtype)
r = pl.DataFrame({"x": "a", "y": "b"}).cast(dtype)

l.join(r, on="x").group_by("y").first()

Log output

join parallel: true
inner join: keys are sorted: use sorted merge join
INNER join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
grouping categoricals, run perfect hash function
zsh: segmentation fault

Issue description

Query causes a segfault.

Expected behavior

Run query without error.

Installed versions

``` --------Version info--------- Polars: 1.14.0 Index type: UInt32 Platform: macOS-13.6.1-arm64-arm-64bit-Mach-O Python: 3.13.0 (main, Oct 7 2024, 05:02:14) [Clang 15.0.0 (clang-1500.1.0.2.5)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair boto3 cloudpickle connectorx deltalake fastexcel 0.12.0 fsspec gevent google.auth great_tables 0.14.0 matplotlib nest_asyncio numpy 2.1.3 openpyxl 3.1.5 pandas 2.2.3 pyarrow 18.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter 3.2.0 ```