Open DavidSlayback opened 4 months ago
I've thought about this a lot, and I think we're getting closer to this world. However my main concern is that this generic dataframe schema will have to include a superset of all the options for all of the dataframes. I think eventually we'll nail down a "common dataframe schema api to rule them all", in which case this concern is less of an issue.
We recently introduced a generic dataframe
api: https://github.com/unionai-oss/pandera/tree/main/pandera/api/dataframe which is where this dispatching might happen. Currently pandas
and polars
schemas inherit from these classes (pyspark
still needs to be done).
If folks engage with this issue (👍 or comment/discuss) we can prioritize this effort, but in the mean time @DavidSlayback if you can write down a spec for how this would all work with perhaps a code snippet sketch of how dispatching would work that would get the ball rolling.
Sure, I'll try to sketch something up later this week when I'm free!
Is your feature request related to a problem? Please describe. It's a small issue, but in a repo that is attempting to transition from Pandas to Polars over time, there is a mix of possible Pandas and Polars dataframes of the same basic schema. Currently, it seems like I need to define two schemas for each: one for Pandas using
pa.DataFrameModel
, one for polars usingpa.polars.DataFrameModel
.Describe the solution you'd like Ideally, the top-level
pa.DataFrameModel
andpa.DataFrameSchema
functions would use something like@singledispatch
to delegate to the appropriate backend version based on the input dataframe. This is similar to an Ibis Table where it's rare that you actually need to go into the specific backend to request a specific function.Describe alternatives you've considered What I'm currently doing is just being more verbose and defining multiple schemas. It works fine! It just seems a bit strange as a workflow. Obviously if we were always in Polars it wouldn't be an issue, but that'll take a while.