unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.15k stars 296 forks source link

Generate SchemaModels from DataFrameSchemas #393

Open ericmjl opened 3 years ago

ericmjl commented 3 years ago

Is your feature request related to a problem? Please describe.

This is a thought that came to mind today while using pandera. Since it is possible to effectively compile a SchemaModel into a DataFrameSchema, is it easy to go in the reverse direction?

I was thinking of this because sometimes I have to generate the dataframe that I want first before I know what the exact Schema ought to be. (Though ideally, I would declare the Schema Model by hand first, then verify that my data are what they ought to be, and iterate.)

jeffzi commented 3 years ago

I'm not sure to follow. Can you give a use-case where you want to go from DataFrameSchema to SchemaModel? Your second paragraph sounds like you want to generate a SchemaModel (or DataFrameSchema) from a DataFrame.

You could also use the new Data Synthesis Strategies to play with the dataframe. That way the schema is defined first.

ericmjl commented 3 years ago

Thanks for getting back, @jeffzi!

I think you are correct, I'm thinking of generating a SchemaModel from a pandas DataFrame. Perhaps the only reason I thought of converting from DataFrameSchema to SchemaModel is that I saw in the docs that going the other way around, from SchemaModel to DataFrameSchema, is possible.

At the moment, pa.infer_schema().to_script() gives us the DataFrameSchema object, but not a SchemaModel object. In-line with my other comment in #392, I was thinking along the lines of defining SchemaModels as data types so that I can map the data flow graph in my code. The most convenient way is for me to leverage the exploratory data analysis that I do on real data first before inferring a schema and then cleaning up the schema by hand.

Is SchemaModel generation from an existing dataframe on the radar at the moment?

cosmicBboy commented 3 years ago

thanks for elaborating @ericmjl!

I think it would be a useful feature to offer to_script functionality for SchemaModel objects.

Currently, the way this is done for DataFrameSchema objects is the metadata is formatted according to a string template, which is then written out to a file and formatted by black.

To achieve this with SchemaModels we could:

  1. add a method DataFrameSchema.to_model and a SchemaModel.to_script method. This way we can do `pa.infer_schema().to_model().to_script().
  2. implement an infer_schema_model function that returns a SchemaModel: https://github.com/pandera-dev/pandera/blob/master/pandera/schema_inference.py#L54-L75. With this, it'd be pa.infer_schema_model().to_script()

As usual, these aren't mutually exclusive, but I rather like (1) since we already have a SchemaModel.to_schema method, so going the other way 'round seems logical to me.

jeffzi commented 3 years ago

I also prefer (1) because you cannot easily tweak a SchemaModel. As highlighted by @ericmjl, you usually want to inspect and modify the inferred schema.

In summary, if we have an existing DataFrame available, the workflow would be:

  1. pa.infer_dataframe_schema(DataFrame) -> pa.DataFrameSchema
  2. Inspect DataFrameSchema and tweak it. There are already methods for altering a DataFrameSchema.

Option (2) proposed by @cosmicBboy would be a wrapper for 1. + 3.b.

Regarding 1, @cosmicBboy mentioned he would like to infer a more comprehensive set of data types and checks from "good" data. My question would be: can we offload the data "inference" to one of the many data exploration libraries and export its results to a pandera schema? Most of those libraries target html export and do not offer a nice api to inspect the generated report (e.g.: pandas-profiling).

Judging from the GH issues, SchemaModel seems rather popular. Should DataFrameSchema.to_model() have priority over better inference?

cosmicBboy commented 3 years ago

My question would be: can we offload the data "inference" to one of the many data exploration libraries and export its results to a pandera schema?

There're trade-offs associated with out-sourcing work to other packages, e.g. if you look far enough back in pandera's commit history, the validation logic was off-loaded to schema, but its abstractions quickly became unsuitable for pandera.

In a similar way, my main concerns are:

  1. supporting non-mature profiling packages. I'm only aware of pandas-profiling, which I'd be okay with supporting. ProfileReport has a to_json method that gets us at least a machine-readable format for profile reports, though json isn't great as it doesn't really enforce a reliable interface (unless there's a json schema associated with it).
  2. bottlenecks: I think the direction pandera is going in is expanding its scope from pure data validation to "making data testing easier for DS/ML practitioners", so it would makes sense to me to have schema inference be a core part of the package. There are also risks/maintenance costs associated with breaking changes on the dependency, we'd be able to move much faster implementing un-supported schema inference statistics if they are implemented natively in pandera. I think it's also important to have control over the performance/runtime of the schema inference for UX reasons.
  3. There's a large gap between statistics gathered by pandas-profiling and statistics supported by pandera (by design, inferred schemas currently only supports built-in checks), so I'd rather pandera implement exactly what it needs to without relying on external packages.

That said, I think we should strongly consider supporting imports from e.g. pandas-profiling. There's a clear synergy between these two packages and it would certainly help pandera in terms of adoption as well.

Should DataFrameSchema.to_model() have priority over better inference?

In terms of actionable next steps:

m-richards commented 3 years ago

Just wanted to say I think having to_model would be really nice to have. I've been using the SchemaModels side of panderra quite a bit and really like having the type annotation style validation. Recently I've been using the transformation features of DataFrameSchema and would love to be able to define SchemaModels from these to keep on using type annotations as I have been (check_io is fine, it's just not as elegant). [ I know you can inherit from SchemaModels but that gets confusing beyong adding a few columns]

rtbs-dev commented 2 years ago

Hi @cosmicBboy, just wanted to chime in here; coming from a data science side, I've been using pydantic a lot lately to model the observations themselves, esp. when there are hierarchies and/or validation needing to happen, and I get a lot of I/O and web compatibility for free (e.g. spaCy uses pydantic/FastAPI now).

However, the vast majority of the time I want to actually analyze or process data as pandas dataframes and/or arrays. I can actually do this pretty efficiently using DataFrame.from_records() on an iterable of BaseModel-like objects... this automatically flattens everything out beautifully and I know my data is nice and clean.

Pandera has been on my team's list of things to test out and use for a while, and I have liked using it for testing a lot (still figuring out when pandera ends and datatest begins, tbh). But the pydantic interop seems to assume that a pandera model will always be "inside" of a pydantic one, while in the vast majority of cases I end up using pydantic to model individuals, and treat pandera as a way to manage collections (which is a heck of a lot nicer than using __root__=CustomIterable inside a BaseModel).

Summary, or, what I think would be great for pandera:

Would be happy to contribute or just show some examples, but I didn't want to forget to mention this again before I get derailed working on the next thing :sweat_smile:

cosmicBboy commented 2 years ago

hey @tbsexton, as discussed in https://github.com/pandera-dev/pandera/issues/764#issuecomment-1073069161, the first use-case is fulfilled with the pandas_engine.PydanticModel type:

Workflow to treat pydantic basemodel-like objects as row-centric view of data, and pandera schemas as column-centric

For the first interop use case:

generate a pandera model by "flattening" a pydantic one?

That would effectively be handled by a pydantic <-> jsonschema <-> pandera flow, the main blocker here is the ability to import/export a jsonschema schema to a pandera schema.

allow groups of columns to be typed as pydantic objects?

What's the problem statement for this use case? i.e. when is it the case that I want to validate only a subset of dataframe columns with a pydantic model?

roundtrip between iterables of identical pydantic models and a validated pandera-approved dataframe, and back

Also need to understand the use case here too... it seems like an iterable of pydantic models have presumably been validated by pydantic already? Then it would simply be DataFrame.from_records([x.dict() for x in iterable_of_pydantic_models]), or am I missing something?

cosmicBboy commented 1 year ago

@ghilesmeddour let me know if you have the capacity to make a contribution here!

sebwills commented 1 year ago

My use case for DataFrameSchema.to_model() is that I like working with SchemaModel, but sometimes I want to build the schema in code. For example I might have a some pre-existing column names defined in code (that I might not control).

KEY_COLUMNS = ['A', 'B', 'C']
METADATA_COLUMNS = ['D', 'E']
VALUE_COLUMNS = ['F', 'G']

I can easily construct a DataFrameSchema by building a dictionary from these, but I'm not aware of nice way to directly generate a SchemaModel "dynamically". Building a DataFrameSchema then calling to_model() would be one way to achieve this. A clean API on SchemaModel for adding columns dynamically would be another way.

evetion commented 1 year ago

+1

My use case is reading a schema from json/yaml, possibly generated externally and using it in a Pydantic model, with the field set to DataFrame[SchemaModel].