pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.67k stars 1.79k forks source link

Broken and inconsistent API for dealing with Categorical variables #17576

Closed u3Izx9ql7vW4 closed 1 month ago

u3Izx9ql7vW4 commented 1 month ago

Checks

Reproducible example

import polars as pl
pl.enable_string_cache()

df = pl.DataFrame({'foo': ['a', 'b']}, schema={'foo': pl.Categorical})
df2 = df.filter(
    pl.col('foo').str.contains('a')
)

Log output

Traceback (most recent call last):
  File "/Users/xyz/Library/Application Support/JetBrains/PyCharmCE2024.1/scratches/scratch_31.py", line 5, in <module>
    df2 = df.filter(
          ^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/polars/dataframe/frame.py", line 4092, in filter
    return self.lazy().filter(*predicates, **constraints).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/polars/utils/deprecation.py", line 100, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1788, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
polars.exceptions.SchemaError: invalid series dtype: expected `Utf8`, got `cat`

Issue description

Not able to filter Categorical variables. I've tried:

pl.toggle_string_cache(True)
with pl.StringCache():
    df2 = df.filter(
        pl.col('foo').str.contains('a')
    )
pl.Config.set_global_string_cache()

It seems like the API is changing every few months. It's a bit humorous that stackoverflow has comments saying it's X, and then only a couple months later it's Y. Can we just pick one and stick with it?

Expected behavior

Filter a column based on its string value.

Installed versions

``` Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fsspec: 2024.2.0 gevent: matplotlib: 3.8.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 2.5.3 pyiceberg: pyxlsb: sqlalchemy: 2.0.30 xlsx2csv: xlsxwriter: None ```
s-banach commented 1 month ago

Did it ever work to use the Expr.str namespace on categorical columns?

cmdlineluser commented 1 month ago

From what I understand, it has yet to be implemented: