pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.93k stars 1.93k forks source link

Crashes and false values from `CategoricalSeries.unique()` #19409

Open s-banach opened 4 hours ago

s-banach commented 4 hours ago

Checks

Reproducible example

df = pl.DataFrame({"x": [str(n) for n in range(50)]}).cast(pl.Categorical)
print(df.unique().null_count())

shape: (1, 1)
┌─────┐
│ x   │
│ --- │
│ u32 │
╞═════╡
│ 2   │
└─────┘

Log output

No response

Issue description

Notice that nulls appear in a series without nulls, after taking unique(). Sometimes python quits without printing anything, depending on the categorical series I test with. Bug does not seem to be present in version 1.10

Expected behavior

Don't add nulls, don't crash.

Installed versions

``` --------Version info--------- Polars: 1.11.0 Index type: UInt32 Platform: Windows-11-10.0.22631-SP0 Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:42:31) [MSC v.1937 64 bit (AMD64)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel 0.10.4 fsspec 2024.6.1 gevent great_tables 0.9.0 matplotlib 3.8.4 nest_asyncio 1.6.0 numpy 2.1.0 openpyxl 3.1.5 pandas 2.2.2 pyarrow 17.0.0 pydantic 2.9.2 pyiceberg sqlalchemy 2.0.35 torch xlsx2csv xlsxwriter 3.2.0 ```
cmdlineluser commented 2 hours ago

I haven't been able to get a crash, but do get a null_count() of 2 each time.

Just some debugging notes:

It seems to have changed after https://github.com/pola-rs/polars/pull/19359

It also seems to go away if I set POLARS_MAX_THREADS=1