Closed e-lo closed 3 years ago
thanks for this feature request @e-lo! Been keeping an eye out on the frictionless data ecosystem and was waiting for someone to have this use case :)
Out of curiosity, how do you use frictionless tables schema, json schema, and pandera in your workflow? It might help in fleshing out the solution to this issue.
Just to riff off of your described solution, here are some initial thoughts:
There are abstractions in either system that are currently not supported by pandera, for example field titles and descriptions. There's an issue for this #331, but I don't think completion of that issue should block this one.
Another abstraction is the concept of primary and foreign keys. I'm not sure yet whether pandera
should support this, so we can kick this can down the road, at least for the initial MVP.
Looking at the constraints, I do believe there's a 1-1 mapping from frictionless data to pandera, so that should be fairly straightforward.
So as an initial approach, we should identify the intersection of features between (i) frictionless
and pandera
and (ii) json-schema
and pandera
and implement mappings from one system to the other.
Out of curiosity, how do you use frictionless tables schema, json schema, and pandera in your workflow?
Desired use case: when working with data with an expected structure codified in a schema file which defines a data standard, be able to validate the data that is read in using a pandera decorator which directly reference that [potentially external] schema file rather than hard-coding the schema.
Right now I'm not using pandera, but have been watching it to see if/when it would solve my use case because it comes packed with a bunch of features that I think would alleviate the need for a multitude of external validators that have sprung up, including the older but very sluggish ones from Frictionless themselves (they don't even use pandas!).
I'm guilty of having written one of those validators myself for validating if data is compatible with the "General Modeling Network Specification" for travel demand modeling:
The other big use case I'm thinking of right now is for GTFS (General Transit Feed Specification) as mentioned in the issue above. Ideally you shouldn't need to run a large validator (see the canonical one) to do some basic validation based on the official spec file, and useful tools for processing GTFS, like partridge, don't have any real validation other than field names and/or have the spec hard-coded in them rather than pointing to the "official" specification.
There are abstractions in either system that are currently not supported by pandera, for example field titles and descriptions. There's an issue for this #331, but I don't think completion of that issue should block this one.
Agree. The titles and descriptions are great for documentation but not necessary for usage.
Another abstraction is the concept of primary and foreign keys. I'm not sure yet whether pandera should support this, so we can kick this can down the road, at least for the initial MVP.
Agree that this is tricky, because it relies on a structure of data - not just a single df. Original frictionless validator didn't do this either...but if this is the only thing I have to implement myself then that's fine ;-)
MVP implementation could just be a uniqueness check if it isn't already explicitly specified as a constraint?
So as an initial approach, we should identify the intersection of features between (i) frictionless and pandera and (ii) json-schema and pandera and implement mappings from one system to the other.
I can take a hack at this if it is helpful, starting with frictionless (because I think it maps easier to dfs) - if you point me to the best list for pandera.
On particular case related to json schema is mapping OpenAPI data models to a pandera schema. The pandera schema that validates data received from the REST API could be synced with the API definition itself.
OpenAPI data models are based on an extended subset of JSON schema.
I can take a hack at this if it is helpful, starting with frictionless (because I think it maps easier to dfs) - if you point me to the best list for pandera.
Thanks @e-lo! Things are getting busy with pandera and I need to turn my attention to some other aspects of the project over the next few weeks, so your contribution would be much appreciated 🎉
I just added this issue to the 0.8.0 release milestone, I think we can tackle the json schema and OpenAPI specifications in separate issues.
A good place to start would be the contributing page to get your dev environment all setup, let me know if you hit any snags in the process.
Re: supporting frictionless, I think a nice UX would be something like:
import pandera as pa
schema = pa.from_frictionless_schema("path/to/schema.json")
@pa.check_input(schema)
def function(dataframe):
...
# do stuff
For implementation, there are three modules to be aware of:
get_frictionless_schema_statistics
.
infer_dataframe_statistics
, infer_series_statistics
, and infer_index_statistics
in this module are misnomers... it should probably be parse_*
instead of infer_*
.schema_statistics
and exposes the function infer_schema
to the end user.from_frictionless_schema
and call schema_statistics.get_frictionless_schema_statistics
to generate a DataFrameSchema
.We might want a to_frictionless_schema
in the future, but we can save that for later :)
Let me know if you have any questions!
@e-lo and @cosmicBboy - I've had a quick go at building out frictionless compatibility in PR above - I'd appreciate any feedback if/when you have a minute!
thanks @TColl! let's go ahead and merge this into the release/0.7.0
branch so we can make it available to users sooner
fixed by #454
Is your feature request related to a problem? Please describe. Many common data standards are codified in json files. Using the same data schema file without having to translate it to
yaml
or into classes itself reduces inconsistency and errors and greatly speeds the up the ability to validate that a dataframe is, for example, a valid GTFS Trips Table .Describe the solution you'd like.
io.py
to deserialize:Describe alternatives you've considered