unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Mypy - SchemaModel.validate does not return a DataFrame #763

Open adrien-turiot-maxa opened 2 years ago

adrien-turiot-maxa commented 2 years ago

The SchemaModel.validate function returns a DataFrameBase[T], which does not extend pd.DataFrame.

This makes type validations fail whenever a pd.DataFrame is expected. For example:

import pandera as pa
from pandera.typing import Series

class Schema(pa.SchemaModel):
    col1: Series[float]
    col2: Series[float]

existing_df = pd.DataFrame({"col1": [1, 2, 3], "col2": [1, 2, 3]})
result = Schema.validate(existing_df)

result.to_csv("test")        # mypy error: "DataFrameBase[Schema]" has no attribute "to_csv"
pd.concat([result, result])  # mypy error: List item has incompatible type "DataFrameBase[Schema]"

Why does Schema.validate return a DataFrameBase[T] instead of a DataFrame[T] ?

This is the same for the SchemaModel.example function.

(pandera version 0.9.0)

lorenzo-w commented 1 year ago

Facing the same issue right now. I would like to validate my dataframes right after loading them from csv and then have the proper type annotation from there. Currently I am using a small custom function which calls SchemaModel.validate and then casts to DataFrame[T], but I would actually expect pandera to already return that....

cosmicBboy commented 1 year ago

Looking into this... basically need to do the following:

Probably for another PR, but will probably also need to overload the DataFrameSchema.validate method: https://github.com/unionai-oss/pandera/blob/main/pandera/schemas.py#L441-L450

@lorenzo-w would you be open to making a contribution here?

lorenzo-w commented 1 year ago

@cosmicBboy Wow thanks! That was the swiftest response I've ever had to a public issue. How could I say no then? 🙃 So yes, I'll take a shot at it this weekend and make a PR if it works.

cosmicBboy commented 1 year ago

Great @lorenzo-w ! The issue's been around for a while, so didn't want it to fall through the cracks again. Let me know if you need any help, check out the contribution guide to get your dev environment set up

adzcai commented 2 months ago

Also running into this issue and I'm happy to help. Just noting that for now you could also call DataFrame[Schema](existing_df) for validation and type-checking