unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.37k stars 310 forks source link

Mixins with only dataframe_checks cause error in SchemaModel #433

Closed khwilson closed 3 years ago

khwilson commented 3 years ago

Describe the bug When creating a SchemaModel, if it only contains dataframe_checks then the to_schema function fails when it tries to take vars(config).

Code Sample, a copy-pastable example

This fails

from typing import List

import pandas as pd
import pandera as pa

def primary_key_mixin(columns: List[str]):
    class _PrimaryKeyMixin(pa.SchemaModel):
        @pa.dataframe_check
        @classmethod
        def check_primary_key(cls, df: pd.DataFrame) -> bool:
            return (df.groupby(columns).size() <= 1).all()

    return _PrimaryKeyMixin

class TestSchema(primary_key_mixin(['primary_key'])):
    primary_key: pa.typing.Series[int]
    other_value: pa.typing.Series[str]

df = pd.DataFrame({
    'primary_key': [1, 2, 3],
    'other_value': ['a', 'b', 'c'],
})

TestSchema.validate(df)

With traceback (replaced some directory names for privacy):

Traceback (most recent call last):
  File "test.py", line 27, in <module>
    TestSchema.validate(df)
  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 184, in validate
    return cls.to_schema().validate(
  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 150, in to_schema
    cls.__config__ = cls._collect_config()
  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 331, in _collect_config
    base_options = _extract_config_options(config)
  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 97, in _extract_config_options
    for name, value in vars(config).items()
TypeError: vars() argument must have __dict__ attribute

Expected behavior

The above should run to completion

Desktop (please complete the following information):

Additional context

If you add a single class-level variable with an annotation of its own it succeeds, e.g.:

from typing import List, Tuple

import pandas as pd
import pandera as pa

def primary_key_mixin(columns: List[str]):
    class _PrimaryKeyMixin(pa.SchemaModel):
        __primary_key__: Tuple[str] = tuple(columns)

        @pa.dataframe_check
        @classmethod
        def check_primary_key(cls, df: pd.DataFrame) -> bool:
            return (df.groupby(columns).size() <= 1).all()

    return _PrimaryKeyMixin

class TestSchema(primary_key_mixin(['primary_key'])):
    primary_key: pa.typing.Series[int]
    other_value: pa.typing.Series[str]

df = pd.DataFrame({
    'primary_key': [1, 2, 3],
    'other_value': ['a', 'b', 'c'],
})

TestSchema.validate(df)

However, if you do NOT have an annotation, it still fails:

from typing import List, Tuple

import pandas as pd
import pandera as pa

def primary_key_mixin(columns: List[str]):
    class _PrimaryKeyMixin(pa.SchemaModel):
        __primary_key__ = tuple(columns)

        @pa.dataframe_check
        @classmethod
        def check_primary_key(cls, df: pd.DataFrame) -> bool:
            return (df.groupby(columns).size() <= 1).all()

    return _PrimaryKeyMixin

class TestSchema(primary_key_mixin(['primary_key'])):
    primary_key: pa.typing.Series[int]
    other_value: pa.typing.Series[str]

df = pd.DataFrame({
    'primary_key': [1, 2, 3],
    'other_value': ['a', 'b', 'c'],
})

TestSchema.validate(df)
#Traceback (most recent call last):
# File "test.py", line 29, in <module>
#    TestSchema.validate(df)
#  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 184, in validate
#    return cls.to_schema().validate(
#  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 150, in to_schema
#    cls.__config__ = cls._collect_config()
#  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 331, in _collect_config
#    base_options = _extract_config_options(config)
#  File ".../.venv/lib/python3.8/site-packages/pandera/model.py", line 97, in _extract_config_options
#    for name, value in vars(config).items()
#TypeError: vars() argument must have __dict__ attribute
khwilson commented 3 years ago

First off, didn't get to say it in the formal report above, but thank you for putting your time into this wonderful library!

To business, it turns out the error turns out to be simpler. You can reproduce the error for any schema with no annotations. For instance, the following code snippet produces a similar error:

import pandera as pa

class ASchema(pa.SchemaModel):
    pass

ASchema.to_schema()
# Traceback (most recent call last):
#   File "test.py", line 10, in <module>
#     ASchema.to_schema()
#   File ".../pandera/model.py", line 150, in to_schema
#     cls.__config__ = cls._collect_config()
#   File ".../pandera/model.py", line 331, in _collect_config
#     base_options = _extract_config_options(config)
#   File ".../pandera/model.py", line 97, in _extract_config_options
#     for name, value in vars(config).items()
# TypeError: vars() argument must have __dict__ attribute

The issue seems to stem from this line: https://github.com/pandera-dev/pandera/blob/master/pandera/model.py#L126

Apparently, at least in Python 3.8.6, if no annotations are specified in the subclass, cls.__annotations__ is (maybe?) the superclass's annotated fields. But if even a single annotation is specified, the __annotations__ dictionary contains only the annotated fields of the subclass.

As an example, if we put a print(cls.__annotations__) right before the for loop there, we get the following:

import pandera as pa

class ASchema(pa.SchemaModel):
    pass
# {'Config': typing.Type[pandera.model.BaseConfig],
#  '__checks__': typing.Dict[str, typing.List[pandera.checks.Check]],
#  '__config__': typing.Union[typing.Type[pandera.model.BaseConfig], NoneType],
#  '__dataframe_checks__': typing.List[pandera.checks.Check],
#  '__fields__': typing.Dict[str, typing.Tuple[pandera.typing.AnnotationInfo, pandera.model_components.FieldInfo]],
#  '__schema__': typing.Union[pandera.schemas.DataFrameSchema, NoneType]}

But if we add an annotation we get:

import pandera as pa

class ASchema(pa.SchemaModel):
    a: pa.typing.Series[int]
# {'a': pandera.typing.Series[int]}

One possible fix would be to use __new__(cls, name, bases, dct) instead of __init_subclass__ and explicitly check if any(cls.__annotations__ is base.__annotations__ for base in bases). But that seems quite brittle.

jeffzi commented 3 years ago

Thanks @khwilson for the bug report and concise examples.

You are right, the problem comes from __annotations__ returning its super's annotations (i.e SchemaModel) if the class doesn't have annotations itself. Config is then misinterpreted as a missing field and overwritten.

I've just pushed a fix. I simply filter out annotations that starts with "_" and "Config", as it is done when collecting regular fields.

Btw, your primary_key_mixin is very nice :star_struck:

khwilson commented 3 years ago

Excellent! One quick comment on your solution, though: it's not common practice but I assume there will be times people have fields that start with _. Perhaps at least updating the documentation to reflect that?

And I'm happy to do a PR with the primary_key_mixin if you wanted to add it to the main library. I've been getting some decent mileage out of it. :-)

jeffzi commented 3 years ago

Actually private annotations have always been ignored since SchemaModel was introduced. I realized now that it was not to specified in the docs. I'll do that !

I think the primary key would be better as a recipe. It seems very specific for pandera itself, but @cosmicBboy might have another opinion. It would be nice if you could post the template in the discussions. I'm sure many people would find it useful.

One thing is that mypy complains: error: Unsupported dynamic base class "primary_key_mixin". We can continue in "discussions" if you don't mind :)

cosmicBboy commented 3 years ago

thanks @khwilson for pointing this out!

Btw, your primary_key_mixin is very nice 🤩

+1, love how flexible this syntax is turning out to be :)

Thanks @jeffzi just ✅ your PR.

khwilson commented 3 years ago

Thanks, y'all! Happy to start a discussion about the mixin!

cosmicBboy commented 3 years ago

fixed by #434