MarcoGorelli commented 1 year ago

Description

Discussed with @FBruzzesi a bit on Discord, so I thought I'd try opening a PR to show what this would look like. Just starting off with the pandas_utils function, curious to get some initial feedback and to know if there'd be interest in this before moving on to the rest

The idea is that, instead of hard-coding support for pandas, you could make use of the DataFrame Consortium API, which is (wip in progress!) defined here: https://data-apis.org/dataframe-api/draft/index.html

Then, any function which adheres to that API will:

keep support for pandas (so, no regression in functionality)
also gain support for Polars, and any other dataframe library which implements the API (devs from cuDF and Modin also want to implement it)

Example

If you run

import pandas as pd
import polars as pl

from sklego.pandas_utils import add_lags

df = pd.DataFrame({"X1": [0, 1, 2], "X2": [float('nan'), "178", "154"]})
print(add_lags(df, "X1", -1))

then it will print

   X1   X2  X1-1
0   1  178   0.0
1   2  154   1.0

, just like it does now. No change here.

However, you can now also run

import polars as pl

from sklego.pandas_utils import add_lags

df = pl.DataFrame({"X1": [0, 1, 2], "X2": [float('nan'), "178", "154"]})
result = add_lags(df, "X1", -1)
print(result.collect())

and you'll get

shape: (2, 3)
┌─────┬─────┬──────┐
│ X1  ┆ X2  ┆ X1-1 │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ i64  │
╞═════╪═════╪══════╡
│ 1   ┆ 178 ┆ 0    │
│ 2   ┆ 154 ┆ 1    │
└─────┴─────┴──────┘

(note how I had to add collect before printing the result, as the result is a Polars LazyFrame)

Dependencies

Note how in sklego/dataframe_utils.py, it's now possible to remove import pandas as pd. If this API were used throughout the package, then you wouldn't even need pandas as a required dependency - scikit-lego would be truly dataframe-agnostic, and would use whichever dataframe package the user passes.

All that would be needed would the dataframe-api-compat package. Note that it's light as a feather:

it's pure Python, with no dependencies (not even pandas / Polars, it'll just defer to whatever the user calls __dataframe_consortium_standard__ on)
the wheel is literally only about 20-30 kilobytes https://pypi.org/project/dataframe-api-compat/0.1.28/#files
it would be easy to vendor if you didn't want to take on an extra dependency

Checklist:

[ ] My code follows the style guidelines (flake8)
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation (also to the readme.md)
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] I have added tests to check whether the new feature adheres to the sklearn convention
[ ] New and existing unit tests pass locally with my changes

If you feel your PR is ready for a review, ping @koaning or @mbrouns.

FBruzzesi commented 1 year ago

Hi @MarcoGorelli, as mentioned privately, this is a very exciting shift for the data ecosystem, and thank you for showcasing how to approach it.

My take on scikit-lego is that the adjustments are fairly tiny and quite easy to implement. The general concern is regarding what should be the process if some functionality is not implemented as api-standard.

@koaning and @MBrouns what do you think?

koaning commented 1 year ago

I'm also in favor of exploring this some more. While there are some dumps to discuss with other features (logging on a dataframe in Polars is different if the DF is lazy, I'd also need to double-check our fairness tools) ... but this initial change seems like a fair place to start!

MarcoGorelli commented 1 year ago

Thanks!

Some parts may require a slight refactor, like TimeGapSplit, because the dataframe api intentionally does not have an index. But it shouldn't be too bad, something like

if isinstance(df, pd.DataFrame):
    # todo: check that `'__index__'` isn't already a column name
    df.reset_index().rename(columns={df.index: '__index__'})

and then do a join based on the column __index__, rather than relying on pandas' default auto-aligning on the index

I'll try taking this further then, let's see how far it can go!

The general concern is regarding what should be the process if some functionality is not implemented as api-standard.

I'd suggest a 3-phase approach:

implement parallel logic, like
```
if isinstance(df, pd.DataFrame):
    # pandas logic
elif isinstance(df, (pl.DataFrame, pl.LazyFrame)):
    # Polars logic
```
The imports could be done lazily so you'd still not need either pandas or polars as required runtime dependency
open an issue at https://github.com/data-apis/dataframe-api-compat - we can add it there and make a new release quickly
open an issue at https://github.com/data-apis/dataframe-api and try to make it part of the standard - this would matter if you wanted to eventually support other dataframes than "just" pandas and Polars

MarcoGorelli commented 9 months ago

this has gone quite out-of-sync, will update soon-ish. trying to get some updates into the api design itself first

koaning / scikit-lego

RFC Use dataframe api to also support Polars and other dataframe libraries #597

Description

Example

Dependencies

Checklist: