Dataframe schema from Pydantic record model

ejmolinelli commented 2 years ago

Is your feature request related to a problem? Please describe. I am experiencing heavy code duplication with model definitions in my code base. Currently this would be ORM (Sqlalchemy), Schema (Pandera), regular dataclasses for type hints, pydantic (basic model validation). I'd like to be able to use pydantic to define a dataframe record and have a datafram use that model as a schema.

Describe the solution you'd like

class MyRecordModel(BaseModel):
   name: str
   xcoord: int
   ycoord: int

and use this model when creating schema for a dataframe

def myfunc(data: DataFrame[MyRecordModel]):
   pass

The linter and type hints should be able to read this so that intellisense (in vscode) can know which columns exist.

e.g. typing data.x will trigger intellisense to identify xcoord.

Describe alternatives you've considered I've tried two solutions so far. 1 - using generics with custom class definitions 2 - manually creating stubs I couldn't get either approach to work to a satisfactory level. I can share more details if necessary.

Additional context I use models for different reasons in my code base. 1 - ORM -> keeping database schema and code in sync and generating objects from sql queries 2 - dataclasses -> mostly for typehinting and signature annotation 3 - model validation -> validating user input and service output, and sometimes validating dataframes 4 - ORM2 -> interacting with objects in a nosql database (e.g. Neo4j, TypeDB)

Creating different models for each purpose is tedious.

cosmicBboy commented 2 years ago

This is an interesting usecase @ejmolinelli, and I do want to decompose it into two problems:

deriving a SchemaModel from a pre-defined pydantic.BaseModel
supporting attribute access of the underlying SchemaModel in the DataFrame[MyRecordModel] object

Firstly tho, have you seen SQLModel? It seems like it does what you might want (it uses Pydantic under the hood) if you're validating single rows at a time.

That said, I do want to explore iteroperability between Pydantic/SQLModel and Pandera. The main benefit of using pandera really is it offers validation speed-ups for larger in-memory dataframes since it exposes pandas-optimized methods via built-in checks (or custom checks, if one knows what one's doing), in addition to integrating with modin, dask, etc for out-of-memory dataframes.

Problem (1) presents some challenges:

how much of the Pydantic BaseModel and type system should pandera support? Starting with built-in python types makes sense, but adding support for things like constrained types and other pydantic-specific types would take a lot of work maintenance. Pandas also has data types drawn from numpy as well as pandas-specific types, adding to the complexity.
how to translate custom pydantic validators into the pandera schema model? There are performance implications here, since in pydantic, custom validators are BaseModel methods that are meant to be applied at the record level, so effectively pandera would need to apply that validator to rows of the dataframe in an element-wise fashion, unless there's a mapping between pydantic constrained types and the vectorized pandera check (which would only work for built-in checks).

Proposal 1: Create a translation layer between `BaseModel` <> `SchemaModel`

# user code, as you suggest
class MyRecordModel(BaseModel):
   name: str
   xcoord: int
   ycoord: int

class SchemaModel(pydantic_model=MyRecordModel):
    class Config:  # pandera-specific configuration
        ...

def myfunc(data: DataFrame[SchemaModel]):
   pass

This gives pandera control over what types of BaseModels can be translated to pandera, throwing errors in the TBD cases where pandera can't convert the BaseModel into a SchemaModel.

We'd have to do some research as to how this design effects problem (2), but in theory perhaps we could do this as you originally suggested:

def myfunc(data: DataFrame[MyRecordModel]):
   pass

Where under the hood pandera.typing.DataFrame converts the pydantic model into a SchemaModel, but it remains to be seen whether this would work with the attribution-completion use case you had, which brings me too...

Proposal 2: Add functionality to the `pandera.mypy` plugin to expose `SchemaModel` fields in a `DataFrame[SchemModel]`-annotated function input

Would need to do more research to see if this is possible, here would be a good place to start looking.

What do you think @ejmolinelli? Pinging @jeffzi too in case he has thoughts on this too

jeffzi commented 2 years ago

deriving a SchemaModel from a pre-defined pydantic.BaseModel

I agree the biggest pain points will be to translate validators/checks. It would be significantly easier to use a common format, such as json schema to inter-operate with other "schema" libraries. Pydantic does support json-schema. Pandera already has an open issue. Of course, this solution will be limited by constrains and types supported by json-schema.

supporting attribute access of the underlying SchemaModel in the DataFrame[MyRecordModel] object

The mypy plugin will help with linting but not auto-completion. I'm not sure we can do anything about it without touching the IDE. Vscode supports TypedDict auto-completion with the bracket notation. For example, typing mytypeddict[.. triggers completion of key names. See https://github.com/microsoft/pylance-release/issues/654. We could ask the Pylance team for suggestions.

cosmicBboy commented 2 years ago

yeah, I think the best path for this would be to go through this process:

pydantic model -> json schema -> pandera schema

So when #421 is implemented (1) should be fairly straightforward to address.

As for (2), thanks for pointing this out @jeffzi

The mypy plugin will help with linting but not auto-completion.

I'd love contributions to provide this auto-completion support, but this is out of my wheelhouse to implement... if anyone in the pandera community would be interested in this I'd wholeheartedly support it!

ejmolinelli commented 2 years ago

Hi @cosmicBboy and @jeffzi Thanks for your considerations.

I think the API you suggest @cosmicBboy is sufficient for my purposes.

Is there not a way to use pydantics own validators when constructing the schema? As to @jeffzi comment, ideally there is a common format, but without such a standard could pandera not simply execute pydantic's validators on each record?I'm not familiar with the inner workings of pandera, so I'm not sure if this is possible.

cosmicBboy commented 2 years ago

ideally there is a common format, but without such a standard could pandera not simply execute pydantic's validators on each record

This is possible, but the challenge is that pydantic is a parsing + validation library: it coerces types of each element in the record and then applies any user-defined custom validation rules.

Naively, you could do something like:

class Schema(pandera.SchemaModel):
    @pandera.dataframe_check
    def check_record(cls, df: pd.DataFrame) -> pa.typing.Series[bool]:

        def _check_row(row):
            try:
                # make sure the row passes pydantic parsing/validation
                MyRecordModel(**row)
                return True
            except:
                return False

        return df.apply(lambda row: MyRecordModel(**row), columns)

However, because pandera's validation model is to return a boolean, boolean Series, or boolean DataFrame indicating the elements that failed (False) or succeeded (True), the actual pydantic-parsed result will not be available on the other end of pandera's validation process.

To achieve that, one idea is to implement a PydanticRecord type that can be applied at the DataFrame level: https://github.com/pandera-dev/pandera/pull/779 is a working prototype that enables this:

import pandas as pd
import pandera as pa
from pydantic import BaseModel

from pandera.engines.pandas_engine import PydanticModel

class Record(BaseModel):
    name: str
    xcoord: int
    ycoord: int

class Schema(pa.SchemaModel):
    class Config:
        dtype = PydanticModel(Record)
        coerce = True

@pa.check_types
def func(df: pa.typing.DataFrame[Schema]):
    return df

df = pd.DataFrame({
    "name": ["foo", "bar", "baz"],
    "xcoord": [1, 2, "c"],
    "ycoord": [4, 5, "d"],
})

print(func(df))
# pandera.errors.SchemaError: error in check_types decorator of function 'func':
# Error while coercing 'Schema' to type <class '__main__.Record'>: Could not
# coerce <class 'pandas.core.frame.DataFrame'> data_container into type
# <class '__main__.Record'>
#    index                    failure_case
# 0      2  {'xcoord': 'c', 'ycoord': 'd'}

@jeffzi @ejmolinelli would you mind reviewing #779? I think it's actually quite nice to use DataTypes for this, great job on this @jeffzi ! Didn't realize how flexible it would be :)

ejmolinelli commented 2 years ago

ok. I'm installing dev and looking through tests now. It may take me a day or so to review.

jeffzi commented 2 years ago

ideally there is a common format

Json schema is a common format that supports limited validations. See reference. Pydantic can already output json schema. There is an external lib datamodel-code-generator to transform json schema to pydantic model. If we write pandera extensions for json schema. We could do pydantic model <-> json-schema <-> pandera. We can also have utility functions to hide the json-schema step, i.e. pandera.DataFrameSchema.to_pydantic()

but without such a standard could pandera not simply execute pydantic's validators on each record

As @cosmicboy demonstrated, that is possible but very inefficient. That may suffice if you have few rows but translating to pandas vectorized operations (as pandera does) is optimal for larger datasets.

That said, we can warn about the shortcomings in the documentation. #779 is still useful if the pydandic model has complex validations not supported by the json-schema method.

I think it's actually quite nice to use DataTypes for this, great job on this

Thanks ! There is actually a shortcoming to DataType that has been bugging me for a while. I'll explain it in #779

ejmolinelli commented 2 years ago

Hey @cosmicBboy and @jeffzi I got a chance to fork pandera and get the tests up and running.

1 - The test works and the user facing API is sufficient

2 - Need TypeAlias or #type: ignore in the code to circumvent mypy/pylance complaints. This is fine, as it is also the case with Pydantic when using things like callables to define a type (e.g. pydantic.constr)

3 - mypy/pylance still complains about the valid_df as an argument to the function, even though the test passes.

I'm not sure if pandera already has a way to cast or assert this dataframe so that i don't have to #type: ignore that line? I know I could do the following,

... but i'd rather assert or the equivalent of pydantic's BaseModel.construct to be able to build the DataFrame without runtime validation. For example, in a production environment I can guarantee that objects coming from a datastore are valid, and I'd like to skip validation to save time.

ejmolinelli commented 2 years ago

I suppose this works

jeffzi commented 2 years ago

Thanks for your comments @ejmolinelli.

I'm not sure if pandera already has a way to cast or assert this dataframe so that i don't have to #type: ignore that line?

There is an experimental mypy plugin included in pandera. Better static linting is actively explored. You can read the current best practices in the documentation here.

Here are 2 approaches:

import pandas as pd
from pydantic import BaseModel
from typing import cast
import pandera as pa
from pandera.engines.pandas_engine import PydanticModel

class Record(BaseModel):
    """Pydantic record model."""

    name: str

class PydanticSchema(pa.SchemaModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(Record)
        coerce = True

# Approach 1: Accept any dataframe and explicitly `cast` to the expected type.
# We know that the dataframe will be coerced by `check_types`.
@pa.check_types
def catchall(df: pd.DataFrame) -> pa.typing.DataFrame[PydanticSchema]:
    return cast(pa.typing.DataFrame[PydanticSchema], df)

vanilla_df = pd.DataFrame({"name": ["foo", "bar", "baz"]})

catchall(vanilla_df)

# Approach 2: Typed dataframe as input and use pandera.typing.DataFrame[PydanticSchema]()
# to create the dataframe, can also cast(pa.typing.DataFrame[PydanticSchema], df)
# afterwards.
@pa.check_types
def func(
    df: pa.typing.DataFrame[PydanticSchema],
) -> pa.typing.DataFrame[PydanticSchema]:
    return df

typed_df = pa.typing.DataFrame[PydanticSchema]({"name": ["foo", "bar", "baz"]})
func(typed_df)
df = cast(pa.typing.DataFrame[PydanticSchema], vanilla_df)
func(df)

The choice of the approach depends on the expected input of your function. In my example, the catchall accepts any DataFrame and returns a coerced dataframe (note the cast on the returned dataframe), whereas the second function must receive a valid dataframe.

...i'd rather assert or the equivalent of pydantic's BaseModel.construct to be able to build the DataFrame without runtime validation.

I think that makes sense but should probably be addressed in a separate issue. @cosmicBboy probably has an opinion about this.

cosmicBboy commented 2 years ago

...i'd rather assert or the equivalent of pydantic's BaseModel.construct to be able to build the DataFrame without runtime validation.

I'm open to this as a feature, esp. since it's part of the pydantic API and the use case makes sense. Please feel free to open up another feature request issue @ejmolinelli !

rtbs-dev commented 2 years ago

I must have missed this looking through the issues list last time, but I wanted to link some observations/comments I had here on using pandera via Pydantic models. Not much to add but I'm a heavy user of this use-case so I'm happy to help as needed

cosmicBboy commented 2 years ago

hi @tbsexton, your use-case should be fulfilled by this PR: https://github.com/pandera-dev/pandera/pull/779

It gives you a way of specifying a PydanticModel(MyRecord) in the DataFrameSchema(dtype=...) constructor. You need to specify coerce=True so that pandera will apply the pydantic model in a row-wise fashion. The main caveat here is that those checks may not be as fast as the equivalent pandera model, which uses pandas-vectorized methods for built-int checks (though I haven't benchmarked this yet...)

To overcome this potential performance loss, we'll look into a pydantic <-> jsonschema <-> pandera solution, where the converters from jsonschema <-> pandera need to be implemented here https://github.com/pandera-dev/pandera/issues/421. Let me know if you'd be open to making a contribution for that issue!

cosmicBboy commented 2 years ago

Closing this issue, opened up #802 for the deeper integration

unionai-oss / pandera