Closed MarcoGorelli closed 9 months ago
Hi @MarcoGorelli, as mentioned privately, this is a very exciting shift for the data ecosystem, and thank you for showcasing how to approach it.
My take on scikit-lego is that the adjustments are fairly tiny and quite easy to implement. The general concern is regarding what should be the process if some functionality is not implemented as api-standard.
@koaning and @MBrouns what do you think?
I'm also in favor of exploring this some more. While there are some dumps to discuss with other features (logging on a dataframe in Polars is different if the DF is lazy, I'd also need to double-check our fairness tools) ... but this initial change seems like a fair place to start!
Thanks!
Some parts may require a slight refactor, like TimeGapSplit
, because the dataframe api intentionally does not have an index. But it shouldn't be too bad, something like
if isinstance(df, pd.DataFrame):
# todo: check that `'__index__'` isn't already a column name
df.reset_index().rename(columns={df.index: '__index__'})
and then do a join based on the column __index__
, rather than relying on pandas' default auto-aligning on the index
I'll try taking this further then, let's see how far it can go!
The general concern is regarding what should be the process if some functionality is not implemented as api-standard.
I'd suggest a 3-phase approach:
if isinstance(df, pd.DataFrame):
# pandas logic
elif isinstance(df, (pl.DataFrame, pl.LazyFrame)):
# Polars logic
The imports could be done lazily so you'd still not need either pandas or polars as required runtime dependency
this has gone quite out-of-sync, will update soon-ish. trying to get some updates into the api design itself first
Description
Discussed with @FBruzzesi a bit on Discord, so I thought I'd try opening a PR to show what this would look like. Just starting off with the
pandas_utils
function, curious to get some initial feedback and to know if there'd be interest in this before moving on to the restThe idea is that, instead of hard-coding support for pandas, you could make use of the DataFrame Consortium API, which is (wip in progress!) defined here: https://data-apis.org/dataframe-api/draft/index.html
Then, any function which adheres to that API will:
Example
If you run
then it will print
, just like it does now. No change here.
However, you can now also run
and you'll get
(note how I had to add
collect
before printing the result, as the result is a Polars LazyFrame)Dependencies
Note how in sklego/dataframe_utils.py, it's now possible to remove
import pandas as pd
. If this API were used throughout the package, then you wouldn't even need pandas as a required dependency - scikit-lego would be truly dataframe-agnostic, and would use whichever dataframe package the user passes.All that would be needed would the dataframe-api-compat package. Note that it's light as a feather:
__dataframe_consortium_standard__
on)Checklist:
If you feel your PR is ready for a review, ping @koaning or @mbrouns.