unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Use pydantic object for column validation #1068

Open alejandro-yousef opened 1 year ago

alejandro-yousef commented 1 year ago

Question about pandera

There are cases in which it is convenient to use a pydantic model to validate a pd.DataFrame. One way to do this is by creating a class inheriting from pa.SchemaModel which includes an excluded attribute for the external object as it follows:

import pandas as pd
import pandera as pa
from pandera.typing import Series
from pydantic import BaseModel, PositiveFloat

class ConfigParams(BaseModel):
    max_value: PositiveFloat
    min_value: PositiveFloat

class MyBaseSchema(pa.SchemaModel):
    _config: ConfigParams

    def __new__(cls, check_obj: pd.DataFrame, config: ConfigParams):
        cls._config = config
        return super().__new__(cls, check_obj)

class MySchema(MyBaseSchema):
    col1: Series[float]

    @pa.check("col1")
    def custom_check(cls, col1: Series[int]) -> Series[bool]:
        return col1.between(cls._config.min_value, cls._config.max_value)

I would like to know if there are better ways to achieve this? It seems to me a rather common use case...

A major disadvantage of this implementation is that calling MySchema.to_schema() results in the following error:

Traceback (most recent call last):
  File "C:\...\scratch_6.py", line 28, in <module>
    MySchema.to_schema()
  File "C:\...\site-packages\pandera\model.py", line 196, in to_schema
    cls.__fields__ = cls._collect_fields()
  File "C:\...\site-packages\pandera\model.py", line 409, in _collect_fields
    field = attrs[field_name]  # __init_subclass__ guarantees existence
KeyError: '_config'

This is important in order to leverage the DataFrameSchema Transformations.

Also all the classes inheriting from MyBaseSchema share the same attribute _config which seems to be risky even if it is overridden for each instance

Relevant versions

python: 3.9 pandera: 0.13.4

tkaraouzene commented 1 year ago

Hi everyone and thanks for this library!

I'm also interest by this question. Any answer?

cosmicBboy commented 1 year ago

hey @alejandro-yousef @tkaraouzene this is probably a common use case... here's the solution: gotta use __init_subclass__ instead of __new__:

import pandas as pd
import pandera as pa
from pandera.typing import Series
from pydantic import BaseModel, PositiveFloat

class ConfigParams(BaseModel):
    min_value: PositiveFloat
    max_value: PositiveFloat

class MyBaseSchema(pa.SchemaModel):
    _custom_config: ConfigParams

    def __init_subclass__(cls, custom_config: ConfigParams, **kwargs):
        super().__init_subclass__(**kwargs)
        cls._custom_config = custom_config

class MySchema(MyBaseSchema, custom_config=ConfigParams(min_value=1, max_value=10)):
    col1: Series[float]

    @pa.check("col1")
    def custom_check(cls, col1: Series[int]) -> Series[bool]:
        return col1.between(cls._custom_config.min_value, cls._custom_config.max_value)

print(MySchema.to_schema())

output:

min_value=1.0 max_value=10.0
<Schema DataFrameSchema(
    columns={
        'col1': <Schema Column(name=col1, type=DataType(float64))>
    },
    checks=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False
    name=MySchema,
    ordered=False,
    unique_column_names=False
)>

error:

pandera.errors.SchemaError: <Schema Column(name=col1, type=DataType(float64))> failed element-wise validator 0:
<Check custom_check>
failure cases:
   index  failure_case
0      0          -1.0

This would actually be a great addition to the tutorials: would one of you be able to add it somewhere on this page? https://pandera.readthedocs.io/en/stable/schema_models.html

tkaraouzene commented 1 year ago

Hi @cosmicBboy !

Thanks for this answer!

alejandro-yousef commented 1 year ago

thanks @cosmicBboy

alejandro-yousef commented 1 year ago

@cosmicBboy looking more closely, I realized that it will not be always possible to instantiate ConfigParams when defining MySchema because the value for the parameters min_value and max_value are only known at running time.

In other words, is there a way to avoid the ConfigParams instantiation in the line below? class MySchema(MyBaseSchema, custom_config=ConfigParams(min_value=1, max_value=10)):

Thanks you for your answers

cosmicBboy commented 1 year ago

hi @alejandro-yousef, seems like we've already discussed a solution for this here: https://github.com/unionai-oss/pandera/discussions/1067 :)

is there something missing in that solution?