unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.38k stars 310 forks source link

DataFrameSchema.__init__ validate_arguments #1016

Closed the-matt-morris closed 1 year ago

the-matt-morris commented 2 years ago

Describe the solution you'd like Use pydantic.validate_arguments for DataFrameSchema.__init__ function signature. This will validate strict, removing the necessity for this statement:

https://github.com/unionai-oss/pandera/blob/6b6c9d5eea12b1b3640b4ba69178ae392132fcac/pandera/schemas.py#L190-L198

It would also add validation for report_duplicates to ensure value is one of ["exclude_first", "exclude_last", "all"]

Would have to play around and make sure it doesn't do anything unintended to any of the other arguments.

from pydantic import validate_arguments

class DataFrameSchema:

    @validate_arguments
    def __init__(
        self,
    ...

Additional context This is not an earth-shattering proposal, but it does remove the need to manage the validation separately from the data type, which is mostly beneficial should the definition of the data type change in the future.

cosmicBboy commented 2 years ago

go for it @the-matt-morris !

the-matt-morris commented 1 year ago

After trying this out locally, I'm thinking it's not going to be worth it:

  1. This line is fine for mypy, but pydantic.validate_arguments actually needs access to Column here, and it can't be imported in pandera.schemas without circular import errors.
  2. Even if that can be solved, creating a schema with invalid strict yields a less useful error message than the one provided already:
import pandera as pa

class MySchema(pa.SchemaModel):
    class Config:
        strict = "yes"

    str_col: pa.typing.Series[str]
    int_col: pa.typing.Series[int]

dataframe_schema = MySchema.to_schema()
Traceback (most recent call last):
...
strict
  value could not be parsed to a boolean (type=type_error.bool)
strict
  unexpected value; permitted: 'filter' (type=value_error.const; given=yep; permitted=('filter',))

That message is more confusing than the existing SchemaInitError:

Traceback (most recent call last):
...
pandera.errors.SchemaInitError: strict parameter must equal either `True`, `False`, or `'filter'`.
cosmicBboy commented 1 year ago

okay, let's close this issue in that case. it was worth a try!