unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.43k stars 310 forks source link

Type checking reports error when using a subclass of the specified DataFrameModel #1170

Open thundercat1 opened 1 year ago

thundercat1 commented 1 year ago

mypy reports type errors when trying to use a subclass of a DataFrameModel in a scenario where the parent class would be accepted. Ordinarily I'd expect that a subclass would satisfy a type hint asking for the parent class.

In the code sample below, type checking fails, but the script executes successfully (pydantic validation succeeds). Is there a better way I should be declaring this kind of type hint, or is this a bug that can/should be fixed?

Environment:

pandera 0.14.5 mypy 1.1.1 Python 3.10.9

import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd

class Animal(pa.DataFrameModel):
    name: Series[str]

class Dog(Animal):
    breed: Series[str]

@pa.check_types(with_pydantic=True)
def say_hello(animal: DataFrame[Animal]):
    # Says hello to any animal
    print(f"Hello {animal.name}!")

dog = DataFrame[Dog](pd.DataFrame({"name": ["Fido"], "breed": ["Labrador"]}))

# Since `Dog` is a subclass of `Animal`, this should be valid
say_hello(dog)

# However, mypy reports an error:
"""
mypy pandera_type_test.py --show-traceback

pandera_type_test.py:22: error: Argument 1 to "say_hello" has incompatible type "DataFrame[Dog]"; expected "DataFrame[Animal]"  [arg-type]
Found 1 error in 1 file (checked 1 source file)
"""
kr-hansen commented 1 year ago

@cosmicBboy I just wanted to follow-up on this. This case seems different from the overwriting case outlined in the documented false positives case for mypy. Any thoughts or insights on this? This seems potentially more related to how DataFrameModels inherit from one another rather than a pandas-stubs or mypy issue.

kr-hansen commented 1 year ago

To further confirm this seems to be something specific to how pandera.DataFrameModels handle inheritance, I took the example from @thundercat1 and ported it to use pydantic models which doesn't show this same type of inheritance issue. Example below:

from pydantic import validate_arguments, BaseModel

class Animal(BaseModel):
    name: str

class Dog(Animal):
    breed: str

@validate_arguments(config=dict(arbitrary_types_allowed=True))
def say_hello(animal: Animal):
    # Says hello to any animal
    print(f"Hello {animal.name}!")

dog = Dog(name="Fido", breed="Labrador")

# Since `Dog` is a subclass of `Animal`, this should be valid
say_hello(dog)

# No mypy error in the case of `pydantic` models
"""
mypy pydantic_type_test.py --show-traceback

Success: no issues found in 1 source file
"""

Since this does seem to be pandera specific, do you have any thoughts of where to look to potentially poke around as to why pandera model inheritance may not be behaving as expected here? Perhaps something to do with the usage of the DataFrame generic type?

kr-hansen commented 1 year ago

All right @cosmicBboy I think I figured it out. It has to do with the DataFrame generic type and covariance & contravariance and the fact that mypy assumes generic types are invariant by default.

The following example adopted from @thundercat1's works fine for me:

from typing import Generic, TypeVar

import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd

TDataFrame_co = TypeVar("TDataFrame_co", covariant=True)

class DataFrame_co(DataFrame, Generic[TDataFrame_co]):
    pass

class Animal(pa.DataFrameModel):
    name: Series[str]

class Dog(Animal):
    breed: Series[str]

@pa.check_types(with_pydantic=True)
def say_hello(animal: DataFrame[Animal]):
    # Says hello to any animal
    print(f"Hello {animal.name}!")

dog = DataFrame_co[Dog](pd.DataFrame({"name": ["Fido"], "breed": ["Labrador"]}))

# Since `Dog` is a subclass of `Animal`, this should be valid
say_hello(dog)

# No mypy error in the case of `pydantic` models
"""
mypy pandera_type_test_covariant.py --show-traceback

Success: no issues found in 1 source file
"""

I'm guessing the pydantic case works because of this definition in the code and how that probably gets perpetuated.

I briefly poked around in the code to see if I could tweak the DataFrame in the source code and hit a couple issues. Happy to try and implement them in a PR if you have any further suggestions @cosmicBboy:

  1. I added covariant=True to this TypeVar and I hit errors on the .from_records static method because that same TypeVar gets used in that method and I get a misc mypy error Cannot use a covariant type variable as a parameter. I'm not sure how to best resolve that error with how that TypeVar is used within that method without messing things up too much.
  2. I'm also getting 25 mypy [union-attrs] errors on trunk in the pandera/backends/pandas directory so didn't dig into resolving those either.

I'm also not sure if to pickup similar behavior for Series if you'd want something similar in the GenericDType or not. Also not sure if this is something you'd want integrated in the code base or not. It looks like the above workaround may work ok for our use case without needing to modify the base code, but thought I'd pass this back up and see your thoughts.

EDIT: This workaround actually doesn't fully work in our case, so we're still kinda stuck on this.

bartwozniak commented 10 months ago

I am facing a similar issue - I'd like the DataFrame to be covariant on its generic type argument. Is there any plan to move in this direction?

trey-stafford commented 3 months ago

Just ran into this issue w/ pandera v0.20.3 & Python 3.12.5.

erklem commented 1 month ago

I agree that it would be better to have child classes succeed when mypy expects their parent class.

In the meantime, I think one could use either typing.Union or typing.Optional to accomplish the desired behavior.

typing.Union The following code has the expected behavior at both lint-time and runtime - it succeeds on all but the last line.

from typing import Union

import pandera as pa
from pandera.typing import DataFrame

class ParentSchema(pa.DataFrameModel):
    parent_id: int

class ChildSchema(ParentSchema):
    child_id: int

class ThirdSchema(pa.DataFrameModel):
    third_id: int

@pa.check_types(lazy=True)
def get_parent_id(df: Union[DataFrame[ParentSchema], DataFrame[ChildSchema]]) -> str:
    return str(df.parent_id)

parent = DataFrame[ParentSchema]({'parent_id': [100]})
child = DataFrame[ChildSchema]({'parent_id': [101], 'child_id': [200]})
third = DataFrame[ThirdSchema]({'third_id': [300]})

print(f'parent_id of parent: {get_parent_id(parent)}')
print(f'parent_id of child: {get_parent_id(child)}')
print(f'parent_id of third: {get_parent_id(third)}')

I prefer this option because it's explicit about accepted schemas. (Though your type hint could become verbose for a deep inheritance tree.)

typing.Optional Again, this code has the expected behavior at run-time and lint-time: succeed on all but the last line.


from typing import Optional

import pandera as pa
from pandera.typing import DataFrame

class ParentSchema(pa.DataFrameModel):
    parent_id: int
    child_id: Optional[int]

class ThirdSchema(pa.DataFrameModel):
    third_id: int

@pa.check_types(lazy=True)
def get_parent_id(df: DataFrame[ParentSchema]) -> str:
    return str(df.parent_id)

parent = DataFrame[ParentSchema]({'parent_id': [100]})
child = DataFrame[ParentSchema]({'parent_id': [101], 'child_id': [200]})
third = DataFrame[ThirdSchema]({'third_id': [300]})

print(f'parent_id of parent: {get_parent_id(parent)}')
print(f'parent_id of child: {get_parent_id(child)}')
print(f'parent_id of third: {get_parent_id(third)}')

This option is less verbose. It doesn't solve the inheritance issue, though; it just avoids inheritance. Additionally, if you have many optional columns, you'll have to handle many possible combinations of schema.

I'm using pandera 0.20.4 and mypy 1.13.0