unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.42k stars 311 forks source link

Optional doesn't flag column as nullable, when other constraints are added to Field #1800

Open antonioalegria opened 2 months ago

antonioalegria commented 2 months ago

Describe the bug A clear and concise description of what the bug is.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from pandera.polars import Field # type: ignore
from pandera.polars import DataFrameModel # type: ignore

from typing import Optional

import polars as pl

class MyModel(DataFrameModel):
    a: Optional[str] = Field(description="some description", nullable=True)
    b: Optional[str] = Field(description="some description") # BOOM
    c: Optional[str] = Field(description="some description", str_contains=".", nullable=True)
    d: Optional[str] = Field(description="some description", str_contains=".") # BOOM

df = pl.DataFrame({"a": ["a", None], "b": ["b.com", None], "c": ["c.com", None], "d": ["d.com", None]})
MyModel.validate(df) # ==> pandera.errors.SchemaError: non-nullable column 'b' contains null values

Expected behavior

The dataframe should've been validated.

Desktop (please complete the following information):

OS: macOS 14.6.1 Python 3.12.4 polars-lts-cpu 1.6.0 pandera 0.20.3

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

cosmicBboy commented 2 months ago

Hi @antonioalegria, this is the intended behavior. Optional marks a column as not required to be in the dataframe (see docs). You still have to mark it as nullable=True specifically in the Field, these are two different behaviors.

antonioalegria commented 2 months ago

I see. Then str | None should be equivalent to nullable=True, no? In any case, if Optional means the column can be missing, it would mean it would also be nullable, no?

I have a workaround that marks all my Optional columns as nullable as well, dynamically but am wondering if there is a more natural (i.e. least unexpected) behavior.

Thanks!