unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Make declarative and imperative schema declaration result in the same type #839

Closed NickCrews closed 2 years ago

NickCrews commented 2 years ago

Thanks for your work on this!

Currently, using the imperative DataFrameSchema API results in a different type of "schema" object as compared to if you create that same schema using the declarative class-based Schema Model API. This is a bit confusing once you have constructed your schema, because the two types have slightly different APIs. Also, you construct them slightly differently, using pa.Field for the declarative model and pa.Column for the imperative model.

Describe the solution you'd like Ideally these two construction APIs would result in the same type of schema. Also, ideally you would construct them more consistently (eg using pa.Column in both).

It looks to me (though I'm not super familiar with it) that sqlalchemy has solved this problem already. See the Declarative and the Imperative style sections in this documentation. Perhaps we can use their implementation as inspiration?

Additional context I realize this is very high-level and not very actionable at this point, but I just wanted to get this idea on your radar. The more pandera matures and more features are added, this will get harder to change. It already might be at the stage where this sort of change would be too disruptive.

EDIT Fixed link to sqlalchemy docs

cosmicBboy commented 2 years ago

hi @NickCrews, thanks for validating my own thoughts on this!

See this discussion for more context.

See the Declarative and the Imperative style sections in this documentation. Perhaps we can use their implementation as inspiration?

Looks like the link is dead.

At a high level I totally agree, although the direction I'm thinking about is going for a more abstract way of representing data containers and their components. This is because pandera is slated to support not just dataframes, but other objects, which for now I'll call "scientific and analytics data containers", i.e. data structures for data scientists/engineers/analysts, ML engineers (e.g. dataframes, xarray.Dataset, tensors, ndarrays, etc). that have some statistical capabilities associated with them.

Currently I feel the most appropriate abstraction is "Container" and "Fields", so for example,

Currently, using the imperative DataFrameSchema API results in a different type of "schema" object as compared to if you create that same schema using the declarative class-based Schema Model API

Just so I understand, why is having different types for each kind of API a problem? I get the point of a similar construction syntax being good for UX, but for this point:

This is a bit confusing once you have constructed your schema, because the two types have slightly different APIs

Are you referring to how you construct the schema, or how you modify that schema? Seems like it would be hard to get away from this, as the two APIs serve two different purposes/programming preferences... the DataFrameSchema is better for inline validation while SchemaModel is better for people who care about python type annotations.

NickCrews commented 2 years ago

Thanks @cosmicBboy! Fixed my link, and replied in the discussion that you linked.

NickCrews commented 2 years ago

Also, the abstraction of container and field makes sense.

I would add (spitballing totally from the armchair, feel free to totally ignore) that I might split up the concerns a bit: