unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Pydanitc-style validators #542

Closed d-chambers closed 3 years ago

d-chambers commented 3 years ago

Hi,

I just found pandera and I am very pleased with the pydantic style support for dataframes. However, one feature that is missing is validators, which, in pydantic, are different from pandera's checks in that they can change/correct values or raise validation errors. From my understanding pandera's checks must always return a boolean indicating if the row is correct and therefore cannot make corrections where desirable.

If validators are not supported, and there is not a better way to do it that I am missing, would you consider a PR adding them? I am thinking something like this:

from pandera.typing import Series

class Schema(pa.SchemaModel):

    column1: Series[int] = pa.Field(le=10)
    column2: Series[float] = pa.Field(lt=-1.2)
    column3: Series[str] = pa.Field()

    @pa.validator("column_3")
    def column_3_validator(cls, series: Series[str]) -> Series[bool]:
        """Ensure each row starts with 'value_' or add it.""""
        missing_pre_str = series.str.startswith('value_')
        series[missing_pre_str] = 'value_' + series[missing_pre_str] 
        return series

Thanks for working on this great library.

cosmicBboy commented 3 years ago

hi @d-chambers, I'm glad you're finding pandera useful!

Yes, completely agreed that the parsing functionality of pydantic is a super useful feature to have.

See #252 for previous discussion about this topic. pydantic and pandera's designs do sort of differ in that pydantic is a parser first, a validation library second, while pandera is primarily a validation tool (with type coercion being the only supported parsing functionality).

I do want to add native support for parsing soon, but I do want to design this carefully to suit dataframes and pandera's schema specification model.

IMO pydantic's validator decorator is a bit of a misnomer, as what it's doing is (i) parsing raw data values and (ii) emitting an error in the case of invalid ones, going to discuss a more fleshed-out proposal in #252, feel free to add your thoughts!

cosmicBboy commented 3 years ago

closing this, merging with #252