unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

check_types decorator does not work on methods #523

Closed internetofjames closed 3 years ago

internetofjames commented 3 years ago

I am using the latest pandera version 0.6.4. I noticed while using SchemaModels that the pandera.check_types decorator does not validate annotated arguments of a method, whether it be a regular method, static method, or class method. Below is a simple SchemaModel, and a simple class with a method that simply prints the columns of the DataFrame.

import pandas as pd
import pandera
from pandera import SchemaModel
from pandera.typing import Series, String, Int, DataFrame

#### Code Sample

class Schema(SchemaModel):
    string_col: Series[String]
    int_col: Series[Int]

    class Config:
        strict = "filter"

class SomeClass:

    @pandera.check_types(lazy=True)
    def print_cols(self, df: DataFrame[Schema]) -> None:
        print(df.columns)

@pandera.check_types(lazy=True)
def print_cols(df: DataFrame[Schema]) -> None:
    print(df.columns)

if __name__ == "__main__":
    some_class = SomeClass()
    df = pd.DataFrame({"string_col": ["foo", "bar"], "int_col": [1, 2], "float_col": [0.1, 0.2]})
    some_class.print_cols(df)
    print_cols(df)

Expected behavior

Since the schema's Config has strict = "filter", I should see the extra float_col of the DataFrame dropped automatically before the columns are printed. Compared to the identical function below, that is not the case.

Output:

Index(['string_col', 'int_col', 'float_col'], dtype='object')
Index(['string_col', 'int_col'], dtype='object')
cristianmatache commented 3 years ago

I found a temporary work-around - use keyword arguments.

import pandera as pa
import pandera.typing as pat

class A(pa.SchemaModel):
    a: pat.Series[int]

    class Config:
        strict = True

class B(A):
    b: pat.Series[int]

    class Config:
        strict = True

class X:

    @pa.check_types
    def f(self, df: pat.DataFrame[A], df2: pat.DataFrame[B]) -> None:
        pass

    def g(self, df: pat.DataFrame[A], df2: pat.DataFrame[B]) -> None:
        A.validate(df)
        B.validate(df2)

a = pat.DataFrame({'a': [1, 2, 3]})
b = a.copy()
b[B.b] = 10
X().f(df=a, df2=b)  # OK - expected -> temporary work-around
X().g(a, b)         # OK - expected
X().f(a, b)         # NOT OK - not expected
jeffzi commented 3 years ago

Thanks for the bug report and clear examples. The issue seems to be that the implicit positional argument self or cls is not re-injected when calling the method after having validated its inputs. https://github.com/pandera-dev/pandera/blob/a0813b7d85c5564fa98554196d459477ca2d272f/pandera/decorators.py#L506

@cosmicBboy I can take care of that.

cristianmatache commented 3 years ago

First, a few quick reminders:

The problem here is that when we re-bind argument values (that come via wrapt) to argument names (that come via inspect.signature) there is a mismatch when dealing with "self".

@jeffzi @cosmicBboy I already fixed it, so I will just open a pull request. But below is the main part of the fix. This is similar to how wrapt itself checks whether the instance is None: e.g. https://github.com/GrahamDumpleton/wrapt/blob/118cdf51bb3d43e22886fea91112361144ed3ee0/src/wrapt/decorators.py#L248

    @wrapt.decorator
    def _wrapper(
        wrapped: Callable,
        instance: Optional[Any],
        args: Tuple[Any, ...],
        kwargs: Dict[str, Any],
    ):
        def validate_args(arguments: OrderedDict[str, Any]) -> Dict[str, Any]:
            return {arg_name: _check_arg(arg_name, arg_value) for arg_name, arg_value in arguments.items()}

        if instance is not None:
            # If the wrapped function is a method -> add "self" as the first positional arg
            args = (instance, *args)

        validated_pos = validate_args(sig.bind_partial(*args).arguments)
        validated_kwd = validate_args(sig.bind_partial(**kwargs).arguments)

        if instance is not None:
            # If the decorated func is a method, wrapped is already a bound method -> remove the "self" argument
            first_pos_arg = list(sig.parameters)[0]
            del validated_pos[first_pos_arg]

        out = wrapped(*validated_pos.values(), **validated_kwd)
        return _check_arg("return", out)
internetofjames commented 3 years ago

Thanks all -- I appreciate the good discussion and quick work! pandera looks to be a very promising tool and I am excited to see where the future takes it! 😎

cosmicBboy commented 3 years ago

fixed by #530