unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.24k stars 302 forks source link

Support PEP 593 Annotated #1333

Open alanhdu opened 12 months ago

alanhdu commented 12 months ago

Right now, pandera's type-checking support is done using a special pandera.typing.DataFrame that is generic over a schema. This is a problem, since it causes lots of confusion with type-checkers that don't have the special mypy-plugin installed (e.g. if you use pyright or pyre).

One way to make things work in a way that is compatible with other type-checkers is to use the PEP 593 Annotated type hint, which IMO is exactly built for this kind of thing.

Instead of

import pandas as pd
from pandera.typing import DataFrame
def fn(df: DataFrame[Schema]) -> DataFrame[SchemaOut]:
    return df.assign(age=30).pipe(DataFrame[SchemaOut])  # mypy okay

You could do

from typing import Annotated
def fn(df: Annotated[pd.DataFrame, Schema]) -> Annotated[pd.DataFrame, Schema]:
    ...

This "attaches" the Schema to the type annotation (in a standard way for things like typeguard or the mypy plugin), but explicitly allows other type-checkers to ignore the attached Schema metadata if they don't use it. Even if the other type-checkers don't support static type-checking, this at least helps document the expected schema in the function signature (which is already a big win IMO).

It'd be really nice if the mypy plugin and @check_types decorator both supported this format.

cosmicBboy commented 10 months ago

Yeah, this would be ideal and probably avoid a lot of the mypy-related linting errors that come from the pandera-specific generics.

This issue has my 🙏 blessing for whoever wants to open up a PR for it!

cswartzvi commented 6 months ago

@cosmicBboy I came up with a quick pydantic "solution" that overrides DataFrameModel.__get_pydantic_core_schema__ and registers DataFrameModel.validate as a validator of the annotated source type. Currently, this only works with pydantic v2 and it requires you to set arbitrary_types_allowed=True (because pandas.DataFrame is not a pydantic type), but it does successfully allow you to validate a DataFrameModel using Annotated with both pydantic.valdiate_call and pydantic.BaseModel.

I implemented this in a fork and it passes the testing suite. Some other work would have to be done in pandera.check_types and I am not sure if this would ever work in pydantic v1, but I just wanted to throw it out there and see if you were interesting in exploring it further - I would be happy to submit a PR. Thanks!

from typing import Annotated, Any

import pandas as pd
import pandera as pa
import pydantic
import pydantic_core
from pandera.typing import Series

class DataFrameModel(pa.DataFrameModel):
    @classmethod
    def __get_pydantic_core_schema__(
        cls, _source_type: Any, _handler: pydantic.GetCoreSchemaHandler
    ) -> pydantic_core.core_schema.CoreSchema:
        if issubclass(_source_type, cls):
            # Allows for using DataFrameModel outside of Annotated
            return super().__get_pydantic_core_schema__(_source_type, _handler)

        # Registers a validator to the source type after inner validation has been performed 
        return pydantic_core.core_schema.no_info_after_validator_function(
            cls.validate, _handler(_source_type)
        )

class People(DataFrameModel):
    name: Series[str]
    age: Series[int]

# arbitrary_types_allowed=True required to apply validators to the inner
# pd.DataFrame, this is essentially an isinstance check
config_dict = pydantic.ConfigDict(arbitrary_types_allowed=True)

@pydantic.validate_call(validate_return=True, config=config_dict)
def read() -> Annotated[pd.DataFrame, People]:
    df = pd.DataFrame({"name": ["bob", "alice "], "age": [40, 35]})
    return df

@pydantic.validate_call(validate_return=True, config=config_dict)
def transform(df: Annotated[pd.DataFrame, People]) -> Annotated[pd.DataFrame, People]:
    return df.assign(people=df.name.str.upper())

people = transform(read())

# Can also be used in BaseModel
class Model(pydantic.BaseModel):
    model_config = config_dict

    df: Annotated[pd.DataFrame, People]

model = Model(df=people)
cosmicBboy commented 5 months ago

Thanks for the prototype solution @cswartzvi! definitely interested in making Annotated the defacto solution: the current way of typing with pandera.typing.DataFrame has a lot of typing issues. Will have to think about whether we want to deprecate pandera.typing.DataFrame[Schema].

I'm currently overhauling some of the pandera backend for the polars integration, let me ping this channel once those changes are merged so that you can make a contribution.

y2kbugger commented 4 months ago

FastAPI would be one example of a widely used library that made/is making this switch. Maybe you could look to that project for ideas/support.

cswartzvi commented 3 months ago

@cosmicBboy now that the polars integration is out of beta, would like you to explore enabling the use of Annotated with DataFrameModel? Like I said, I would be happy to open a PR. I figured one could start with changes to DataFrameModel.__get_pydantic_core_schema__ (in my fork) and check_types (not currently in my fork) to make them compatible with both pandera.typing.DataFrame[Schema] and Annotated[pandas.DataFrame, Schema] (along with other dataframe libraries).