unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.39k stars 310 forks source link

Import frictionless data table schemas and json-schemas #420

Closed e-lo closed 3 years ago

e-lo commented 3 years ago

Is your feature request related to a problem? Please describe. Many common data standards are codified in json files. Using the same data schema file without having to translate it to yaml or into classes itself reduces inconsistency and errors and greatly speeds the up the ability to validate that a dataframe is, for example, a valid GTFS Trips Table .

Describe the solution you'd like.

  1. Add a functions in io.py to deserialize:
  1. Overridable default mappings between checks in pandera and json schema/frictionless

Describe alternatives you've considered

cosmicBboy commented 3 years ago

thanks for this feature request @e-lo! Been keeping an eye out on the frictionless data ecosystem and was waiting for someone to have this use case :)

Out of curiosity, how do you use frictionless tables schema, json schema, and pandera in your workflow? It might help in fleshing out the solution to this issue.

Just to riff off of your described solution, here are some initial thoughts:

MVP implementation of frictionless and json schema parser

There are abstractions in either system that are currently not supported by pandera, for example field titles and descriptions. There's an issue for this #331, but I don't think completion of that issue should block this one.

Another abstraction is the concept of primary and foreign keys. I'm not sure yet whether pandera should support this, so we can kick this can down the road, at least for the initial MVP.

Looking at the constraints, I do believe there's a 1-1 mapping from frictionless data to pandera, so that should be fairly straightforward.

So as an initial approach, we should identify the intersection of features between (i) frictionless and pandera and (ii) json-schema and pandera and implement mappings from one system to the other.

e-lo commented 3 years ago

Out of curiosity, how do you use frictionless tables schema, json schema, and pandera in your workflow?

Desired use case: when working with data with an expected structure codified in a schema file which defines a data standard, be able to validate the data that is read in using a pandera decorator which directly reference that [potentially external] schema file rather than hard-coding the schema.

Right now I'm not using pandera, but have been watching it to see if/when it would solve my use case because it comes packed with a bunch of features that I think would alleviate the need for a multitude of external validators that have sprung up, including the older but very sluggish ones from Frictionless themselves (they don't even use pandas!).

I'm guilty of having written one of those validators myself for validating if data is compatible with the "General Modeling Network Specification" for travel demand modeling:

The other big use case I'm thinking of right now is for GTFS (General Transit Feed Specification) as mentioned in the issue above. Ideally you shouldn't need to run a large validator (see the canonical one) to do some basic validation based on the official spec file, and useful tools for processing GTFS, like partridge, don't have any real validation other than field names and/or have the spec hard-coded in them rather than pointing to the "official" specification.

e-lo commented 3 years ago

There are abstractions in either system that are currently not supported by pandera, for example field titles and descriptions. There's an issue for this #331, but I don't think completion of that issue should block this one.

Agree. The titles and descriptions are great for documentation but not necessary for usage.

e-lo commented 3 years ago

Another abstraction is the concept of primary and foreign keys. I'm not sure yet whether pandera should support this, so we can kick this can down the road, at least for the initial MVP.

Agree that this is tricky, because it relies on a structure of data - not just a single df. Original frictionless validator didn't do this either...but if this is the only thing I have to implement myself then that's fine ;-)

MVP implementation could just be a uniqueness check if it isn't already explicitly specified as a constraint?

e-lo commented 3 years ago

So as an initial approach, we should identify the intersection of features between (i) frictionless and pandera and (ii) json-schema and pandera and implement mappings from one system to the other.

I can take a hack at this if it is helpful, starting with frictionless (because I think it maps easier to dfs) - if you point me to the best list for pandera.

jeffzi commented 3 years ago

On particular case related to json schema is mapping OpenAPI data models to a pandera schema. The pandera schema that validates data received from the REST API could be synced with the API definition itself.

OpenAPI data models are based on an extended subset of JSON schema.

cosmicBboy commented 3 years ago

I can take a hack at this if it is helpful, starting with frictionless (because I think it maps easier to dfs) - if you point me to the best list for pandera.

Thanks @e-lo! Things are getting busy with pandera and I need to turn my attention to some other aspects of the project over the next few weeks, so your contribution would be much appreciated 🎉

I just added this issue to the 0.8.0 release milestone, I think we can tackle the json schema and OpenAPI specifications in separate issues.

A good place to start would be the contributing page to get your dev environment all setup, let me know if you hit any snags in the process.

Re: supporting frictionless, I think a nice UX would be something like:

import pandera as pa

schema = pa.from_frictionless_schema("path/to/schema.json")

@pa.check_input(schema)
def function(dataframe):
    ...
    # do stuff

For implementation, there are three modules to be aware of:

  1. schema_statistics.py: this extracts schema statistics (fields, their data types, and checks and their sufficient statistics, e.g. min and max values) from a dataframe. It also defines functions for extracting the schema specification from a pandera schema (e.g. get_dataframe_statistics. This is probably where the heavy lifting of extracting the statistics from a frictionless data schema should occur in a function like get_frictionless_schema_statistics.
    • note that infer_dataframe_statistics, infer_series_statistics, and infer_index_statistics in this module are misnomers... it should probably be parse_* instead of infer_*.
  2. schema_inference.py: this basically wraps the functions in schema_statistics and exposes the function infer_schema to the end user.
  3. io.py: logic where serialization/deserialization of yaml and serialization to python script lives. This would define from_frictionless_schema and call schema_statistics.get_frictionless_schema_statistics to generate a DataFrameSchema.

We might want a to_frictionless_schema in the future, but we can save that for later :)

Let me know if you have any questions!

TColl commented 3 years ago

@e-lo and @cosmicBboy - I've had a quick go at building out frictionless compatibility in PR above - I'd appreciate any feedback if/when you have a minute!

cosmicBboy commented 3 years ago

thanks @TColl! let's go ahead and merge this into the release/0.7.0 branch so we can make it available to users sooner

cosmicBboy commented 3 years ago

fixed by #454