koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

Closed FBruzzesi closed 5 months ago

FBruzzesi commented 6 months ago

Description

Creating this issue to keep track of which classes/function could benefit from adopting Narwhals.

Class/Function/Module Status Related Notes
preprocessing.ColumnDropper โœ… Solved in #655
preprocessing.ColumnSelector โœ… Solved in #659
preprocessing.PandasTypeSelector โœ… Solved in #670 Consider creating another class TypeSelector and deprecate this one
common.TrainOnlyTransformerMixin ๐Ÿ”ฒ Uses specific pandas hashing functionality. I wonder how crucial is to hash the index as well. If it's not we could just use .to_numpy() and hash the array data?
model_selection.TimeGapSplit โœ… Solved in #668
model_selection.GroupedTimeSeriesSplit ๐Ÿ”ฒ Related to #605
projections.InformationFilter โœ… Solved in commit
meta.RegressionOutlierDetector โœ… Solved in #665
meta.hierarchical_predictor.py โœ… Solved in #667
meta.grouped_transformer.py โœ… Solved in #667
meta.grouped_predictor.py โœ… Solved in #667
linear_models._FairClassifier โœ… Solved in #669
pandas_utils.py ๐Ÿšง Partially #661 (*)
datasets.py ๐Ÿšซ It would require read_csv function

Personally I would wait to have at least preprocessing.pandastransformers.py entire migration before bumping to v0.9.0.

cc @MarcoGorelli @anopsy

(*) Regarding pandas_utils

Changing log_step to narwhals is fairly easy (~4 lines of code), however as this decorator is supposed to work for any function that operates on pandas, doing so would limit its functionality. It could be reasonable to add another one which although restricted to narwhals methods, it can interoperate with all its compatible dataframes.

Legend

โœ… Done ๐Ÿšง WIP ๐Ÿ”ฒ Not Started ๐Ÿšซ Won't do

anopsy commented 6 months ago

I'm on it

MarcoGorelli commented 6 months ago

Thanks for making this tracker!

I think grouping shouldn't be an issue, there's an example of that here: https://narwhals-dev.github.io/narwhals/basics/dataframe/#example-2-group-by-and-mean

If I remember correctly, TimeGapSplit relies on index alignment, so that one may need a bit more discussion

DeaMariaLeon commented 6 months ago

Before this (nice) tracker was opened, I was taking a look at using Narwhals on _FairClassifier per Marco's suggestion. So, I guess I should change that?

Also, I'm trying to run the tests but I'm having issues installing what's needed. Specifically, pystest is complaining becausecvxpyis not installed. That can't be installed because "Could not build wheels for osqp". I believe there is an issue with the macs with M1/M2. Where can I ask about it? (I don't find a better place than here, sorry).. Friendly ping to @FBruzzesi, just in case. :-)

FBruzzesi commented 6 months ago

Hey @DeaMariaLeon, thanks for the ping. I forgot to add _FairClassifier to the list ๐Ÿ™ˆ Let me fix that in a moment.

Regarding cvxpy, I tried to do a quick search, but couldn't find related material. Maybe osqp build from source guide could help. Keep me posted on that ๐Ÿ˜‡

koaning commented 6 months ago

One thing about that FairClassifier ... there may be a solid reason to consider porting that over to fairlearn at some point. The thinking here is that while scikit-lego is a cool place for somewhat experimental and fun components ... fairness might be a more serious topic so it might be a better home for our fairness tools. No formal decision has ever been made regarding this but it was something that was always in the back of people's mind.

DeaMariaLeon commented 6 months ago

I'll leave FairClassifier for later then. What I noticed is that check_X_y must be kept, as it checks that X is not a sparse matrix (IIUC). The implementation gets more complicated..

Regarding cvxpy, thanks for the help @FBruzzesi. Everything is installed now... But when running pytest, there are 14 errors. I wonder if I should ignore that (Running tests removing my changes):

ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes0-prior0-cfm0-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes1-prior1-cfm1-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes2-prior2-cfm2-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes3-prior3-cfm3-0.985]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes4-prior4-cfm4-1.0]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes5-prior5-cfm5-0.2]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes6-prior6-cfm6-0.1]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes7-prior7-cfm7-0.696]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[predict_proba-expected_probas0]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[confusion_matrix-expected_probas1]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[both-expected_probas2]
ERROR tests/test_preprocessing/test_outlier_remover.py::test_no_outliers
ERROR tests/test_preprocessing/test_outlier_remover.py::test_remove_outlier
ERROR tests/test_preprocessing/test_outlier_remover.py::test_do_not_refit
DeaMariaLeon commented 6 months ago

Maybe we should write which function we are working on, so we don't overlap?

I would like to work on PandasTypeSelector if that's ok.

DeaMariaLeon commented 6 months ago

Scikit-learn has a very similar function than PandasTypeSelector: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html

Is TypesSelector still needed then? (I imagine that it is, just making sure here).

koaning commented 6 months ago

Some of those tools were added 5-6 years ago back when sklearn did not support those features. Scikit is a lot further now but on the scikit-lego side of things we've kept things around since we didn't really see a reason to remove them. Since we're interested in giving narwhals a spin here I'd argue it'd probably be good to keep them around if only because the implementation exercise will also demonstrate that it can be used in other projects.

DeaMariaLeon commented 6 months ago

Narwhals is not ready to easily work for PandasTypeSelector yet.. per Marco's suggestion I went back to work with linear_models._FairClassifier. @FBruzzesi

MarcoGorelli commented 6 months ago

Been looking into TimeGapSplit:

However, other than the index, X has no knowledge of how it relates to date_serie, so... ๐Ÿค” not sure how we're going to get this one to work for index-less libraries. Maybe for those, require that X is the same length as date_serie, and then align positionally?

FBruzzesi commented 6 months ago

@DeaMariaLeon sure, no problem for me! Let me know if I can help on the Narwhals side to make the missing bits available. @MarcoGorelli TimeGapSplit does a few things in a bit of unconventional way imho.

Maybe for those, require that X is the same length as date_serie, and then align positionally?

This is a good assumption in my opinion (but I am biased as I do the same in timebasedcv)

FBruzzesi commented 6 months ago

@MarcoGorelli I may have made a commit bypassing review by accident ๐Ÿ™ˆ. Could you take a look at it when you have the time?

It should make InformationFilter dataframe agnostic

MarcoGorelli commented 6 months ago

It should make InformationFilter dataframe agnostic

Nice, looks good! looks like it preserves the existing logic

This has been quite a group effort, and much progress has been made - at what point do you think a release might be warranted? I'm presenting Narwhals at PyCon Italia in the 23rd of May (in 12 days), would be pretty cool if I could give a live demo of scikit-lego ๐Ÿ˜Ž I understand if you'd rather wait until there's a bit more though, no pressure!

FBruzzesi commented 6 months ago

@MarcoGorelli that would be grand! Let's try to make a release within this week (or next weekend at latest?!).

Personally I would love to be able to replace PandasTypeSelector as well and have a "Hey we replaced the entire dataframe module to be agnostic"-release ๐Ÿ˜‚

koaning commented 6 months ago

would be pretty cool if I could give a live demo of scikit-lego ๐Ÿ˜Ž I understand if you'd rather wait until there's a bit more though, no pressure!

@MarcoGorelli the stuff that's already been done seems sufficient to me as a "demo". It's mainly to report that folks are indeed working on the integration and that sofar it seems to work.

MarcoGorelli commented 6 months ago

Totally! I just meant, it would be nice to be able to demo this result:

In [15]: %timeit add_lags(df, ['a', 'b', 'c'], [1, 2, 3, 4, 5])
4.35 ms ยฑ 206 ยตs per loop (mean ยฑ std. dev. of 7 runs, 100 loops each)

In [16]: %timeit pl.from_pandas(add_lags(df.to_pandas(), ['a', 'b', 'c'], [1, 2, 3, 4, 5]))
140 ms ยฑ 1.83 ms per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)

In [17]: type(df)
Out[17]: polars.dataframe.frame.DataFrame

and say something like "This is why "just convert to pandas" isn't a satisfying solution. Native Polars support brings you library to another level - you can use this today, go out and pip install -U sklego and pass in Polars dataframes as you please!"

Let's try to make a release within this week (or next weekend at latest?!).

Sounds great! Given the pace at which things are moving, it may be realistic to have the whole thing done by then. If in a couple of places, non-pandas input is converted to pandas under the hood (like the grouped predictor in https://github.com/koaning/scikit-lego/pull/667) then TBH I think that's fine as a temporary solution. And I don't think it would be temporary for too long ๐Ÿš€

MarcoGorelli commented 6 months ago

Consider creating another class TypeSelector and deprecate this one [PandasTypeSelector]

Been thinking about this one. I think what would be good to aim for would be:

One suggestion I have is:

Like this, scikit-lego keeps the same API for pandas and non-pandas users. For pandas, it keeps using select_dtypes (so, perfect backwards compatibility), whereas for non-pandas it uses Narwhals' selectors.by_dtype (which I still need to implement, but it shouldn't be too bad) under the hood but keeps the existing user-facing API


An alternative would be to let users passing in Narwhals dtypes, but I'm not sure that end users should know about Narwhals ๐Ÿค”

FBruzzesi commented 6 months ago

I just merged #669 and #670. I would argue we are in a very good position for a release, I will open a PR to merge the dev branch into main and bump version.

I am sorry that I am a bit stuck on the GroupedPredictor for now, but it will get there soon(ish) ๐Ÿ˜

MarcoGorelli commented 6 months ago

cool, thanks!

granted, I'm slightly biased as I'm quite keen to present this as "you can try this today!" at the conference, but to be honest I don't think GroupedPredictor should be a blocker - we can definitely solve it, I don't see any theoretical reasons we can't do it, it'll just take a little longer

FBruzzesi commented 5 months ago

Closed by #671 Thanks everyone! This was a huge effort, very appreciated and we are quite enthusiastic by the result!

koaning commented 5 months ago

Just so folks know, it's all live now. Might make sense to see if we can collaborate on making a splash with an announcement?

MarcoGorelli commented 5 months ago

Might make sense to see if we can collaborate on making a splash with an announcement?

Yup! The blog post is now live: https://labs.quansight.org/blog/scikit-lego-narwhals

Some posts I made: