Closed derinwalters closed 1 year ago
hi @derinwalters this is currently unexplored territory, would appreciate clarification on the use cases here.
For the .dict()
method, is the expectation that the df
key is turned into a list of records? or some other format?
I suspect once dict
works the json()
method should as well.
Are you familiar how to create custom pydantic types? How does one extend a type within a BaseModel
can be converted to a json-serializable dict
?
so looking at pydantic docs, this will work:
from devtools import debug
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic
class SimpleSchema(pa.SchemaModel):
str_col: Series[str] = pa.Field(unique=True)
class PydanticModel(pydantic.BaseModel):
x: int
df: DataFrame[SimpleSchema]
class Config:
json_encoders = {
pd.DataFrame: lambda x: x.to_dict(orient="records")
}
valid_df = pd.DataFrame({"str_col": ["hello", "world"]})
myinst = PydanticModel(x=1, df=valid_df)
debug(myinst)
debug(myinst.dict())
debug(myinst.json())
Output:
foo.py:21 <module>
myinst: PydanticModel(
x=1,
df=<DataFrame({
'str_col': <Series({
0: 'hello',
1: 'world',
})>,
})>,
) (PydanticModel)
foo.py:22 <module>
myinst.dict(): {
'x': 1,
'df': <DataFrame({
'str_col': <Series({
0: 'hello',
1: 'world',
})>,
})>,
} (dict) len=2
foo.py:23 <module>
myinst.json(): '{"x": 1, "df": [{"str_col": "hello"}, {"str_col": "world"}]}' (str) len=60
The .dict()
method is not really customizable, the the json_encoders
configuration lets your serialize your validated data to json
by letting it know how to handle certain, potentially unknown types.
@cosmicBboy thank you so much for your suggestion. Leveraging the Config json_encoders seems like just the thing. I will give this a try and report back.
The use case is a hierarchical data class that I store in MongoDB and process locally. Recently I transitioned from a monolithic Pandas dataframe to lists of Pydantic class dictionaries where I convert to Pandas for manipulation. However, this incurs extra to-from conversion cost that never really seemed ideal. I don't remember exactly how, but last week I stumbled across Pandera and thought to myself "this is exactly what I was looking for!" and so here I am kicking the tires.
However, this incurs extra to-from conversion cost that never really seemed ideal
Yep! this is pretty much the reason I built pandera, though at the time I wasn't aware of pydantic and was doing the same thing with the schema library.
I think the proposed solution works well enough for what I was asking. Thanks! I'm having a bit of trouble though with figuring out how to properly validate columns of list-like and dictionary-like elements, which is rather straightforward in a pydantic by row approach, and will continue working on that. Looks like you're also already working on providing a default value option on #502, which is great.
Great!
I'm having a bit of trouble though with figuring out how to properly validate columns of list-like and dictionary-like elements
There's this issue https://github.com/unionai-oss/pandera/issues/260, but for now I'd recommend custom checks
class SimpleSchema(pa.SchemaModel):
list_col: Series[object]
dict_col: Series[object]
@pa.check("list_col")
def check_list(cls, series):
return series.map(lambda x: isinstance(x, list)) # check any other property about this column
@pa.check("dict_col")
def check_list(cls, series):
return series.map(lambda x: isinstance(x, dict)) # check any other property about this column
In reading through the Pandera documentation, it's not clear to me how to intermingle Pandera dataframes within a Pydantic model and still be able to use .dict() and .json() methods successfully. I followed the steps on https://pandera.readthedocs.io/en/stable/pydantic_integration.html#using-pandera-schemas-in-pydantic-models and love how seamless it is. However, the .dict() method keeps the Pandera type and .json() fails altogether. The solution provided by Pandera's to_format is close, but I want to keep the validated dataframe intact while I perform operations then convert format later (not right away). Is there a way to do this?