unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.42k stars 311 forks source link

Columns converted to `str` if there is one `Category` column and `coerce = True` #1806

Open antonioalegria opened 2 months ago

antonioalegria commented 2 months ago

Describe the bug A clear and concise description of what the bug is.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import datetime
from pandera.polars import Field # type: ignore
from pandera.polars import DataFrameModel # type: ignore
from pandera.engines.polars_engine import Category
from typing import Optional

import polars as pl

class MyModel(DataFrameModel):
    a: datetime.datetime = Field(description="some description")
    b: Category = Field(description="some description", dtype_kwargs={"categories": ["a", "b", "c"]})

df = pl.DataFrame({"a": [datetime.datetime.now(), datetime.datetime.now()], "b": ["a", "b"]})
schema = MyModel.to_schema()
schema.strict = True
schema.coerce = True # -> this is what causes the conversion of `a` to string
print(schema.validate(df)) # BOOM!

Exception: pandera.errors.SchemaError: expected column 'a' to have type Datetime(time_unit='us', time_zone=None), got String

When checking the df in debug mode, we see that it's after the coerce of column b that column a becomes converted to str.

Expected behavior

The dataframe should've been validated and no types changed in this case. Column b should be coerced to pl.Category.

Desktop (please complete the following information):

OS: macOS 14.6.1 Python 3.12.4 polars-lts-cpu 1.6.0 pandera 0.20.3

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

antonioalegria commented 2 months ago

Workaround: use pl.Category in the schema and the isin checker.

Ideally we would be able to use Literal['a', 'b', 'c'] as the column type in the model, but that doesn't seem to be supported (another error is thrown when converting the model to schema).