koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

RFC Use dataframe api to also support Polars and other dataframe libraries #597

Closed MarcoGorelli closed 9 months ago

MarcoGorelli commented 1 year ago

Description

Discussed with @FBruzzesi a bit on Discord, so I thought I'd try opening a PR to show what this would look like. Just starting off with the pandas_utils function, curious to get some initial feedback and to know if there'd be interest in this before moving on to the rest

The idea is that, instead of hard-coding support for pandas, you could make use of the DataFrame Consortium API, which is (wip in progress!) defined here: https://data-apis.org/dataframe-api/draft/index.html

Then, any function which adheres to that API will:

Example

If you run

import pandas as pd
import polars as pl

from sklego.pandas_utils import add_lags

df = pd.DataFrame({"X1": [0, 1, 2], "X2": [float('nan'), "178", "154"]})
print(add_lags(df, "X1", -1))

then it will print

   X1   X2  X1-1
0   1  178   0.0
1   2  154   1.0

, just like it does now. No change here.

However, you can now also run

import polars as pl

from sklego.pandas_utils import add_lags

df = pl.DataFrame({"X1": [0, 1, 2], "X2": [float('nan'), "178", "154"]})
result = add_lags(df, "X1", -1)
print(result.collect())

and you'll get

shape: (2, 3)
┌─────┬─────┬──────┐
│ X1  ┆ X2  ┆ X1-1 │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ i64  │
╞═════╪═════╪══════╡
│ 1   ┆ 178 ┆ 0    │
│ 2   ┆ 154 ┆ 1    │
└─────┴─────┴──────┘

(note how I had to add collect before printing the result, as the result is a Polars LazyFrame)

Dependencies

Note how in sklego/dataframe_utils.py, it's now possible to remove import pandas as pd. If this API were used throughout the package, then you wouldn't even need pandas as a required dependency - scikit-lego would be truly dataframe-agnostic, and would use whichever dataframe package the user passes.

All that would be needed would the dataframe-api-compat package. Note that it's light as a feather:

Checklist:

If you feel your PR is ready for a review, ping @koaning or @mbrouns.

FBruzzesi commented 1 year ago

Hi @MarcoGorelli, as mentioned privately, this is a very exciting shift for the data ecosystem, and thank you for showcasing how to approach it.

My take on scikit-lego is that the adjustments are fairly tiny and quite easy to implement. The general concern is regarding what should be the process if some functionality is not implemented as api-standard.

@koaning and @MBrouns what do you think?

koaning commented 1 year ago

I'm also in favor of exploring this some more. While there are some dumps to discuss with other features (logging on a dataframe in Polars is different if the DF is lazy, I'd also need to double-check our fairness tools) ... but this initial change seems like a fair place to start!

MarcoGorelli commented 1 year ago

Thanks!

Some parts may require a slight refactor, like TimeGapSplit, because the dataframe api intentionally does not have an index. But it shouldn't be too bad, something like

if isinstance(df, pd.DataFrame):
    # todo: check that `'__index__'` isn't already a column name
    df.reset_index().rename(columns={df.index: '__index__'})

and then do a join based on the column __index__, rather than relying on pandas' default auto-aligning on the index

I'll try taking this further then, let's see how far it can go!

The general concern is regarding what should be the process if some functionality is not implemented as api-standard.

I'd suggest a 3-phase approach:

  1. implement parallel logic, like
    if isinstance(df, pd.DataFrame):
        # pandas logic
    elif isinstance(df, (pl.DataFrame, pl.LazyFrame)):
        # Polars logic

    The imports could be done lazily so you'd still not need either pandas or polars as required runtime dependency

  2. open an issue at https://github.com/data-apis/dataframe-api-compat - we can add it there and make a new release quickly
  3. open an issue at https://github.com/data-apis/dataframe-api and try to make it part of the standard - this would matter if you wanted to eventually support other dataframes than "just" pandas and Polars
MarcoGorelli commented 9 months ago

this has gone quite out-of-sync, will update soon-ish. trying to get some updates into the api design itself first