unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Pandera dataframe in Pydantic model .dict() and .json() compatability #966

Closed derinwalters closed 1 year ago

derinwalters commented 1 year ago

In reading through the Pandera documentation, it's not clear to me how to intermingle Pandera dataframes within a Pydantic model and still be able to use .dict() and .json() methods successfully. I followed the steps on https://pandera.readthedocs.io/en/stable/pydantic_integration.html#using-pandera-schemas-in-pydantic-models and love how seamless it is. However, the .dict() method keeps the Pandera type and .json() fails altogether. The solution provided by Pandera's to_format is close, but I want to keep the validated dataframe intact while I perform operations then convert format later (not right away). Is there a way to do this?

from devtools import debug
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic

class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)

class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
myinst = PydanticModel(x=1, df=valid_df)
debug(myinst)
debug(myinst.dict())
debug(myinst.json())
test.py:26 <module>
    myinst: PydanticModel(
        x=1,
        df=<DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    ) (PydanticModel)
test.py:27 <module>
    myinst.dict(): {
        'x': 1,
        'df': <DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    } (dict) len=2
Traceback (most recent call last):
  File "/Users/derinw/x-bitbucket/juso/tests/test.py", line 28, in <module>
    debug(myinst.json())
  File "pydantic/main.py", line 505, in pydantic.main.BaseModel.json
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/derinw/miniforge3/envs/juso-dev/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "pydantic/json.py", line 90, in pydantic.json.pydantic_encoder
TypeError: Object of type 'DataFrame' is not JSON serializable
cosmicBboy commented 1 year ago

hi @derinwalters this is currently unexplored territory, would appreciate clarification on the use cases here.

For the .dict() method, is the expectation that the df key is turned into a list of records? or some other format?

I suspect once dict works the json() method should as well.

Are you familiar how to create custom pydantic types? How does one extend a type within a BaseModel can be converted to a json-serializable dict?

cosmicBboy commented 1 year ago

so looking at pydantic docs, this will work:

from devtools import debug
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic

class SimpleSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True)

class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

    class Config:
        json_encoders = {
            pd.DataFrame: lambda x: x.to_dict(orient="records")
        }

valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
myinst = PydanticModel(x=1, df=valid_df)
debug(myinst)
debug(myinst.dict())
debug(myinst.json())

Output:

foo.py:21 <module>
    myinst: PydanticModel(
        x=1,
        df=<DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    ) (PydanticModel)
foo.py:22 <module>
    myinst.dict(): {
        'x': 1,
        'df': <DataFrame({
            'str_col': <Series({
                0: 'hello',
                1: 'world',
            })>,
        })>,
    } (dict) len=2
foo.py:23 <module>
    myinst.json(): '{"x": 1, "df": [{"str_col": "hello"}, {"str_col": "world"}]}' (str) len=60

The .dict() method is not really customizable, the the json_encoders configuration lets your serialize your validated data to json by letting it know how to handle certain, potentially unknown types.

derinwalters commented 1 year ago

@cosmicBboy thank you so much for your suggestion. Leveraging the Config json_encoders seems like just the thing. I will give this a try and report back.

The use case is a hierarchical data class that I store in MongoDB and process locally. Recently I transitioned from a monolithic Pandas dataframe to lists of Pydantic class dictionaries where I convert to Pandas for manipulation. However, this incurs extra to-from conversion cost that never really seemed ideal. I don't remember exactly how, but last week I stumbled across Pandera and thought to myself "this is exactly what I was looking for!" and so here I am kicking the tires.

cosmicBboy commented 1 year ago

However, this incurs extra to-from conversion cost that never really seemed ideal

Yep! this is pretty much the reason I built pandera, though at the time I wasn't aware of pydantic and was doing the same thing with the schema library.

derinwalters commented 1 year ago

I think the proposed solution works well enough for what I was asking. Thanks! I'm having a bit of trouble though with figuring out how to properly validate columns of list-like and dictionary-like elements, which is rather straightforward in a pydantic by row approach, and will continue working on that. Looks like you're also already working on providing a default value option on #502, which is great.

cosmicBboy commented 1 year ago

Great!

I'm having a bit of trouble though with figuring out how to properly validate columns of list-like and dictionary-like elements

There's this issue https://github.com/unionai-oss/pandera/issues/260, but for now I'd recommend custom checks

class SimpleSchema(pa.SchemaModel):
    list_col: Series[object]
    dict_col: Series[object]

    @pa.check("list_col")
    def check_list(cls, series):
        return series.map(lambda x: isinstance(x, list)) # check any other property about this column

    @pa.check("dict_col")
    def check_list(cls, series):
        return series.map(lambda x: isinstance(x, dict)) # check any other property about this column