Closed ejmolinelli closed 2 years ago
This is an interesting usecase @ejmolinelli, and I do want to decompose it into two problems:
SchemaModel
from a pre-defined pydantic.BaseModel
SchemaModel
in the DataFrame[MyRecordModel]
objectFirstly tho, have you seen SQLModel? It seems like it does what you might want (it uses Pydantic under the hood) if you're validating single rows at a time.
That said, I do want to explore iteroperability between Pydantic/SQLModel and Pandera. The main benefit of using pandera really is it offers validation speed-ups for larger in-memory dataframes since it exposes pandas-optimized methods via built-in checks (or custom checks, if one knows what one's doing), in addition to integrating with modin
, dask
, etc for out-of-memory dataframes.
Problem (1) presents some challenges:
BaseModel
<> SchemaModel
# user code, as you suggest
class MyRecordModel(BaseModel):
name: str
xcoord: int
ycoord: int
class SchemaModel(pydantic_model=MyRecordModel):
class Config: # pandera-specific configuration
...
def myfunc(data: DataFrame[SchemaModel]):
pass
This gives pandera control over what types of BaseModel
s can be translated to pandera, throwing errors in the TBD cases where pandera can't convert the BaseModel into a SchemaModel.
We'd have to do some research as to how this design effects problem (2), but in theory perhaps we could do this as you originally suggested:
def myfunc(data: DataFrame[MyRecordModel]):
pass
Where under the hood pandera.typing.DataFrame
converts the pydantic model into a SchemaModel, but it remains to be seen whether this would work with the attribution-completion use case you had, which brings me too...
pandera.mypy
plugin to expose SchemaModel
fields in a DataFrame[SchemModel]
-annotated function inputWould need to do more research to see if this is possible, here would be a good place to start looking.
What do you think @ejmolinelli? Pinging @jeffzi too in case he has thoughts on this too
- deriving a SchemaModel from a pre-defined pydantic.BaseModel
I agree the biggest pain points will be to translate validators/checks. It would be significantly easier to use a common format, such as json schema to inter-operate with other "schema" libraries. Pydantic does support json-schema. Pandera already has an open issue. Of course, this solution will be limited by constrains and types supported by json-schema.
- supporting attribute access of the underlying SchemaModel in the DataFrame[MyRecordModel] object
The mypy plugin will help with linting but not auto-completion. I'm not sure we can do anything about it without touching the IDE. Vscode supports TypedDict
auto-completion with the bracket notation. For example, typing mytypeddict[..
triggers completion of key names. See https://github.com/microsoft/pylance-release/issues/654. We could ask the Pylance team for suggestions.
yeah, I think the best path for this would be to go through this process:
pydantic model -> json schema -> pandera schema
So when #421 is implemented (1) should be fairly straightforward to address.
As for (2), thanks for pointing this out @jeffzi
The mypy plugin will help with linting but not auto-completion.
I'd love contributions to provide this auto-completion support, but this is out of my wheelhouse to implement... if anyone in the pandera community would be interested in this I'd wholeheartedly support it!
Hi @cosmicBboy and @jeffzi Thanks for your considerations.
I think the API you suggest @cosmicBboy is sufficient for my purposes.
Is there not a way to use pydantics own validators when constructing the schema? As to @jeffzi comment, ideally there is a common format, but without such a standard could pandera not simply execute pydantic's validators on each record?I'm not familiar with the inner workings of pandera, so I'm not sure if this is possible.
ideally there is a common format, but without such a standard could pandera not simply execute pydantic's validators on each record
This is possible, but the challenge is that pydantic is a parsing + validation library: it coerces types of each element in the record and then applies any user-defined custom validation rules.
Naively, you could do something like:
class Schema(pandera.SchemaModel):
@pandera.dataframe_check
def check_record(cls, df: pd.DataFrame) -> pa.typing.Series[bool]:
def _check_row(row):
try:
# make sure the row passes pydantic parsing/validation
MyRecordModel(**row)
return True
except:
return False
return df.apply(lambda row: MyRecordModel(**row), columns)
However, because pandera's validation model is to return a boolean, boolean Series, or boolean DataFrame indicating the elements that failed (False) or succeeded (True), the actual pydantic-parsed result will not be available on the other end of pandera's validation process.
To achieve that, one idea is to implement a PydanticRecord
type that can be applied at the DataFrame level: https://github.com/pandera-dev/pandera/pull/779 is a working prototype that enables this:
import pandas as pd
import pandera as pa
from pydantic import BaseModel
from pandera.engines.pandas_engine import PydanticModel
class Record(BaseModel):
name: str
xcoord: int
ycoord: int
class Schema(pa.SchemaModel):
class Config:
dtype = PydanticModel(Record)
coerce = True
@pa.check_types
def func(df: pa.typing.DataFrame[Schema]):
return df
df = pd.DataFrame({
"name": ["foo", "bar", "baz"],
"xcoord": [1, 2, "c"],
"ycoord": [4, 5, "d"],
})
print(func(df))
# pandera.errors.SchemaError: error in check_types decorator of function 'func':
# Error while coercing 'Schema' to type <class '__main__.Record'>: Could not
# coerce <class 'pandas.core.frame.DataFrame'> data_container into type
# <class '__main__.Record'>
# index failure_case
# 0 2 {'xcoord': 'c', 'ycoord': 'd'}
@jeffzi @ejmolinelli would you mind reviewing #779? I think it's actually quite nice to use DataType
s for this, great job on this @jeffzi ! Didn't realize how flexible it would be :)
ok. I'm installing dev and looking through tests now. It may take me a day or so to review.
ideally there is a common format
Json schema is a common format that supports limited validations. See reference. Pydantic can already output json schema. There is an external lib datamodel-code-generator to transform json schema to pydantic model. If we write pandera extensions for json schema. We could do pydantic model
<-> json-schema
<-> pandera
. We can also have utility functions to hide the json-schema step, i.e. pandera.DataFrameSchema.to_pydantic()
but without such a standard could pandera not simply execute pydantic's validators on each record
As @cosmicboy demonstrated, that is possible but very inefficient. That may suffice if you have few rows but translating to pandas vectorized operations (as pandera does) is optimal for larger datasets.
That said, we can warn about the shortcomings in the documentation. #779 is still useful if the pydandic model has complex validations not supported by the json-schema method.
I think it's actually quite nice to use DataTypes for this, great job on this
Thanks ! There is actually a shortcoming to DataType that has been bugging me for a while. I'll explain it in #779
Hey @cosmicBboy and @jeffzi I got a chance to fork pandera and get the tests up and running.
1 - The test works and the user facing API is sufficient
2 - Need TypeAlias or #type: ignore
in the code to circumvent mypy/pylance complaints. This is fine, as it is also the case with Pydantic when using things like callables to define a type (e.g. pydantic.constr)
3 - mypy/pylance still complains about the valid_df as an argument to the function, even though the test passes.
I'm not sure if pandera already has a way to cast or assert this dataframe so that i don't have to #type: ignore
that line? I know I could do the following,
... but i'd rather assert or the equivalent of pydantic's BaseModel.construct
to be able to build the DataFrame without runtime validation. For example, in a production environment I can guarantee that objects coming from a datastore are valid, and I'd like to skip validation to save time.
I suppose this works
Thanks for your comments @ejmolinelli.
I'm not sure if pandera already has a way to cast or assert this dataframe so that i don't have to #type: ignore that line?
There is an experimental mypy plugin included in pandera. Better static linting is actively explored. You can read the current best practices in the documentation here.
Here are 2 approaches:
import pandas as pd
from pydantic import BaseModel
from typing import cast
import pandera as pa
from pandera.engines.pandas_engine import PydanticModel
class Record(BaseModel):
"""Pydantic record model."""
name: str
class PydanticSchema(pa.SchemaModel):
"""Pandera schema using the pydantic model."""
class Config:
"""Config with dataframe-level data type."""
dtype = PydanticModel(Record)
coerce = True
# Approach 1: Accept any dataframe and explicitly `cast` to the expected type.
# We know that the dataframe will be coerced by `check_types`.
@pa.check_types
def catchall(df: pd.DataFrame) -> pa.typing.DataFrame[PydanticSchema]:
return cast(pa.typing.DataFrame[PydanticSchema], df)
vanilla_df = pd.DataFrame({"name": ["foo", "bar", "baz"]})
catchall(vanilla_df)
# Approach 2: Typed dataframe as input and use pandera.typing.DataFrame[PydanticSchema]()
# to create the dataframe, can also cast(pa.typing.DataFrame[PydanticSchema], df)
# afterwards.
@pa.check_types
def func(
df: pa.typing.DataFrame[PydanticSchema],
) -> pa.typing.DataFrame[PydanticSchema]:
return df
typed_df = pa.typing.DataFrame[PydanticSchema]({"name": ["foo", "bar", "baz"]})
func(typed_df)
df = cast(pa.typing.DataFrame[PydanticSchema], vanilla_df)
func(df)
The choice of the approach depends on the expected input of your function. In my example, the catchall
accepts any DataFrame and returns a coerced dataframe (note the cast
on the returned dataframe), whereas the second function must receive a valid dataframe.
...i'd rather assert or the equivalent of pydantic's BaseModel.construct to be able to build the DataFrame without runtime validation.
I think that makes sense but should probably be addressed in a separate issue. @cosmicBboy probably has an opinion about this.
...i'd rather assert or the equivalent of pydantic's BaseModel.construct to be able to build the DataFrame without runtime validation.
I'm open to this as a feature, esp. since it's part of the pydantic API and the use case makes sense. Please feel free to open up another feature request issue @ejmolinelli !
I must have missed this looking through the issues list last time, but I wanted to link some observations/comments I had here on using pandera via Pydantic models. Not much to add but I'm a heavy user of this use-case so I'm happy to help as needed
hi @tbsexton, your use-case should be fulfilled by this PR: https://github.com/pandera-dev/pandera/pull/779
It gives you a way of specifying a PydanticModel(MyRecord)
in the DataFrameSchema(dtype=...)
constructor. You need to specify coerce=True
so that pandera will apply the pydantic model in a row-wise fashion. The main caveat here is that those checks may not be as fast as the equivalent pandera model, which uses pandas-vectorized methods for built-int checks (though I haven't benchmarked this yet...)
To overcome this potential performance loss, we'll look into a pydantic <-> jsonschema <-> pandera
solution, where the converters from jsonschema <-> pandera need to be implemented here https://github.com/pandera-dev/pandera/issues/421. Let me know if you'd be open to making a contribution for that issue!
Closing this issue, opened up #802 for the deeper integration
Is your feature request related to a problem? Please describe. I am experiencing heavy code duplication with model definitions in my code base. Currently this would be ORM (Sqlalchemy), Schema (Pandera), regular dataclasses for type hints, pydantic (basic model validation). I'd like to be able to use pydantic to define a dataframe record and have a datafram use that model as a schema.
Describe the solution you'd like
and use this model when creating schema for a dataframe
The linter and type hints should be able to read this so that intellisense (in vscode) can know which columns exist.
e.g. typing
data.x
will trigger intellisense to identifyxcoord
.Describe alternatives you've considered I've tried two solutions so far. 1 - using generics with custom class definitions 2 - manually creating stubs I couldn't get either approach to work to a satisfactory level. I can share more details if necessary.
Additional context I use models for different reasons in my code base. 1 - ORM -> keeping database schema and code in sync and generating objects from sql queries 2 - dataclasses -> mostly for typehinting and signature annotation 3 - model validation -> validating user input and service output, and sometimes validating dataframes 4 - ORM2 -> interacting with objects in a nosql database (e.g. Neo4j, TypeDB)
Creating different models for each purpose is tedious.