pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

all_horizontal and any_horizontal allow Categoricals and Enums #14247

Open Wainberg opened 9 months ago

Wainberg commented 9 months ago

Checks

Reproducible example

>>> pl.DataFrame({'a': ['a'], 'b': ['b']}, schema={'a': pl.Categorical, 'b': pl.Categorical}).select(pl.all_horizontal('a', 'b'))
 a
 false
shape: (1, 1)
>>> pl.DataFrame({'a': ['a'], 'b': ['b']}, schema={'a': pl.Enum(['a']), 'b': pl.Enum(['b'])}).select(pl.all_horizontal('a', 'b'))
 a
 false
shape: (1, 1)
>>> pl.DataFrame({'a': ['a'], 'b': ['b']}, schema={'a': pl.Enum(['q', 'a']), 'b': pl.Enum(['q', 'b'])}).select(pl.all_horizontal('a', 'b'))
 a
 true
shape: (1, 1)

Log output

No response

Issue description

all_horizontal and any_horizontal allow Categoricals and Enums, interpreting them as booleans based on whether their underlying integer physical representation is equal to 0.

Expected behavior

This should be an error.

Installed versions

``` --------Version info--------- Polars: 0.20.6 Index type: UInt32 Platform: Linux-4.4.0-22621-Microsoft-x86_64-with-glibc2.35 Python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:03:24) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: 3.8.2 numpy: 1.26.3 openpyxl: 3.1.2 pandas: 2.2.0 pyarrow: 14.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: 0.8.1 xlsxwriter: 3.1.9 ```
ritchie46 commented 9 months ago

Yes, this should raise indeed.

reswqa commented 9 months ago

Furthermore, why do we allow the casting from Categorical to Boolean in the first place? Given that we're not actually allowed casting from String to Boolean. The casting result will always be false for items equals to first element (which is physically equal to 0) and true for the rest, this seems pointless.

mcrumiller commented 9 months ago

FYI this seems to be true for all logical dtypes, not just cats/enums:

import polars as pl
from datetime import date, datetime, time, duration

df = pl.DataFrame({
    "date":     [date(2024, 1, 1)],
    "datetime": [datetime(2024, 1, 1)],
    "duration": [timedelta(1)],
    "time":     [time(1)],
    "cat":      pl.Series(["a"], dtype=pl.Categorical),
    "enum":     pl.Series(["a"], dtype=pl.Enum("a"))
})
df.select(pl.all_horizontal(pl.all()))
shape: (1, 1)
┌───────┐
│ date  │
│ ---   │
│ bool  │
╞═══════╡
│ false │
└───────┘