Open thundercat1 opened 1 year ago
@cosmicBboy I just wanted to follow-up on this. This case seems different from the overwriting case outlined in the documented false positives case for mypy
. Any thoughts or insights on this? This seems potentially more related to how DataFrameModels
inherit from one another rather than a pandas-stubs
or mypy
issue.
To further confirm this seems to be something specific to how pandera.DataFrameModels
handle inheritance, I took the example from @thundercat1 and ported it to use pydantic
models which doesn't show this same type of inheritance issue. Example below:
from pydantic import validate_arguments, BaseModel
class Animal(BaseModel):
name: str
class Dog(Animal):
breed: str
@validate_arguments(config=dict(arbitrary_types_allowed=True))
def say_hello(animal: Animal):
# Says hello to any animal
print(f"Hello {animal.name}!")
dog = Dog(name="Fido", breed="Labrador")
# Since `Dog` is a subclass of `Animal`, this should be valid
say_hello(dog)
# No mypy error in the case of `pydantic` models
"""
mypy pydantic_type_test.py --show-traceback
Success: no issues found in 1 source file
"""
Since this does seem to be pandera
specific, do you have any thoughts of where to look to potentially poke around as to why pandera
model inheritance may not be behaving as expected here? Perhaps something to do with the usage of the DataFrame
generic type?
All right @cosmicBboy I think I figured it out. It has to do with the DataFrame
generic type and covariance & contravariance and the fact that mypy assumes generic types are invariant by default.
The following example adopted from @thundercat1's works fine for me:
from typing import Generic, TypeVar
import pandera as pa
from pandera.typing import Series, DataFrame
import pandas as pd
TDataFrame_co = TypeVar("TDataFrame_co", covariant=True)
class DataFrame_co(DataFrame, Generic[TDataFrame_co]):
pass
class Animal(pa.DataFrameModel):
name: Series[str]
class Dog(Animal):
breed: Series[str]
@pa.check_types(with_pydantic=True)
def say_hello(animal: DataFrame[Animal]):
# Says hello to any animal
print(f"Hello {animal.name}!")
dog = DataFrame_co[Dog](pd.DataFrame({"name": ["Fido"], "breed": ["Labrador"]}))
# Since `Dog` is a subclass of `Animal`, this should be valid
say_hello(dog)
# No mypy error in the case of `pydantic` models
"""
mypy pandera_type_test_covariant.py --show-traceback
Success: no issues found in 1 source file
"""
I'm guessing the pydantic case works because of this definition in the code and how that probably gets perpetuated.
I briefly poked around in the code to see if I could tweak the DataFrame
in the source code and hit a couple issues. Happy to try and implement them in a PR if you have any further suggestions @cosmicBboy:
covariant=True
to this TypeVar and I hit errors on the .from_records
static method because that same TypeVar gets used in that method and I get a misc mypy error Cannot use a covariant type variable as a parameter
. I'm not sure how to best resolve that error with how that TypeVar is used within that method without messing things up too much.[union-attrs]
errors on trunk in the pandera/backends/pandas
directory so didn't dig into resolving those either.I'm also not sure if to pickup similar behavior for Series
if you'd want something similar in the GenericDType or not. Also not sure if this is something you'd want integrated in the code base or not. It looks like the above workaround may work ok for our use case without needing to modify the base code, but thought I'd pass this back up and see your thoughts.
EDIT: This workaround actually doesn't fully work in our case, so we're still kinda stuck on this.
I am facing a similar issue - I'd like the DataFrame
to be covariant on its generic type argument. Is there any plan to move in this direction?
Just ran into this issue w/ pandera v0.20.3 & Python 3.12.5.
I agree that it would be better to have child classes succeed when mypy expects their parent class.
In the meantime, I think one could use either typing.Union
or typing.Optional
to accomplish the desired behavior.
typing.Union The following code has the expected behavior at both lint-time and runtime - it succeeds on all but the last line.
from typing import Union
import pandera as pa
from pandera.typing import DataFrame
class ParentSchema(pa.DataFrameModel):
parent_id: int
class ChildSchema(ParentSchema):
child_id: int
class ThirdSchema(pa.DataFrameModel):
third_id: int
@pa.check_types(lazy=True)
def get_parent_id(df: Union[DataFrame[ParentSchema], DataFrame[ChildSchema]]) -> str:
return str(df.parent_id)
parent = DataFrame[ParentSchema]({'parent_id': [100]})
child = DataFrame[ChildSchema]({'parent_id': [101], 'child_id': [200]})
third = DataFrame[ThirdSchema]({'third_id': [300]})
print(f'parent_id of parent: {get_parent_id(parent)}')
print(f'parent_id of child: {get_parent_id(child)}')
print(f'parent_id of third: {get_parent_id(third)}')
I prefer this option because it's explicit about accepted schemas. (Though your type hint could become verbose for a deep inheritance tree.)
typing.Optional Again, this code has the expected behavior at run-time and lint-time: succeed on all but the last line.
from typing import Optional
import pandera as pa
from pandera.typing import DataFrame
class ParentSchema(pa.DataFrameModel):
parent_id: int
child_id: Optional[int]
class ThirdSchema(pa.DataFrameModel):
third_id: int
@pa.check_types(lazy=True)
def get_parent_id(df: DataFrame[ParentSchema]) -> str:
return str(df.parent_id)
parent = DataFrame[ParentSchema]({'parent_id': [100]})
child = DataFrame[ParentSchema]({'parent_id': [101], 'child_id': [200]})
third = DataFrame[ThirdSchema]({'third_id': [300]})
print(f'parent_id of parent: {get_parent_id(parent)}')
print(f'parent_id of child: {get_parent_id(child)}')
print(f'parent_id of third: {get_parent_id(third)}')
This option is less verbose. It doesn't solve the inheritance issue, though; it just avoids inheritance. Additionally, if you have many optional columns, you'll have to handle many possible combinations of schema.
I'm using pandera 0.20.4 and mypy 1.13.0
mypy
reports type errors when trying to use a subclass of a DataFrameModel in a scenario where the parent class would be accepted. Ordinarily I'd expect that a subclass would satisfy a type hint asking for the parent class.In the code sample below, type checking fails, but the script executes successfully (pydantic validation succeeds). Is there a better way I should be declaring this kind of type hint, or is this a bug that can/should be fixed?
Environment:
pandera 0.14.5 mypy 1.1.1 Python 3.10.9