Closed NickCrews closed 2 years ago
hi @NickCrews, thanks for validating my own thoughts on this!
See this discussion for more context.
See the Declarative and the Imperative style sections in this documentation. Perhaps we can use their implementation as inspiration?
Looks like the link is dead.
At a high level I totally agree, although the direction I'm thinking about is going for a more abstract way of representing data containers and their components. This is because pandera
is slated to support not just dataframes, but other objects, which for now I'll call "scientific and analytics data containers", i.e. data structures for data scientists/engineers/analysts, ML engineers (e.g. dataframes, xarray.Dataset
, tensors, ndarrays, etc). that have some statistical capabilities associated with them.
Currently I feel the most appropriate abstraction is "Container" and "Fields", so for example,
Currently, using the imperative DataFrameSchema API results in a different type of "schema" object as compared to if you create that same schema using the declarative class-based Schema Model API
Just so I understand, why is having different types for each kind of API a problem? I get the point of a similar construction syntax being good for UX, but for this point:
This is a bit confusing once you have constructed your schema, because the two types have slightly different APIs
Are you referring to how you construct the schema, or how you modify that schema? Seems like it would be hard to get away from this, as the two APIs serve two different purposes/programming preferences... the DataFrameSchema
is better for inline validation while SchemaModel
is better for people who care about python type annotations.
Thanks @cosmicBboy! Fixed my link, and replied in the discussion that you linked.
Also, the abstraction of container and field makes sense.
I would add (spitballing totally from the armchair, feel free to totally ignore) that I might split up the concerns a bit:
Thanks for your work on this!
Currently, using the imperative
DataFrameSchema
API results in a different type of "schema" object as compared to if you create that same schema using the declarative class-based Schema Model API. This is a bit confusing once you have constructed your schema, because the two types have slightly different APIs. Also, you construct them slightly differently, using pa.Field for the declarative model and pa.Column for the imperative model.Describe the solution you'd like Ideally these two construction APIs would result in the same type of schema. Also, ideally you would construct them more consistently (eg using pa.Column in both).
It looks to me (though I'm not super familiar with it) that sqlalchemy has solved this problem already. See the Declarative and the Imperative style sections in this documentation. Perhaps we can use their implementation as inspiration?
Additional context I realize this is very high-level and not very actionable at this point, but I just wanted to get this idea on your radar. The more pandera matures and more features are added, this will get harder to change. It already might be at the stage where this sort of change would be too disruptive.
EDIT Fixed link to sqlalchemy docs