Closed FBruzzesi closed 5 months ago
I'm on it
Thanks for making this tracker!
I think grouping shouldn't be an issue, there's an example of that here: https://narwhals-dev.github.io/narwhals/basics/dataframe/#example-2-group-by-and-mean
If I remember correctly, TimeGapSplit
relies on index alignment, so that one may need a bit more discussion
Before this (nice) tracker was opened, I was taking a look at using Narwhals on _FairClassifier
per Marco's suggestion. So, I guess I should change that?
Also, I'm trying to run the tests but I'm having issues installing what's needed. Specifically, pystest
is complaining becausecvxpy
is not installed. That can't be installed because "Could not build wheels for osqp
". I believe there is an issue with the macs with M1/M2. Where can I ask about it? (I don't find a better place than here, sorry)..
Friendly ping to @FBruzzesi, just in case. :-)
Hey @DeaMariaLeon, thanks for the ping. I forgot to add _FairClassifier
to the list ๐ Let me fix that in a moment.
Regarding cvxpy, I tried to do a quick search, but couldn't find related material. Maybe osqp build from source guide could help. Keep me posted on that ๐
One thing about that FairClassifier ... there may be a solid reason to consider porting that over to fairlearn at some point. The thinking here is that while scikit-lego is a cool place for somewhat experimental and fun components ... fairness might be a more serious topic so it might be a better home for our fairness tools. No formal decision has ever been made regarding this but it was something that was always in the back of people's mind.
I'll leave FairClassifier for later then. What I noticed is that check_X_y
must be kept, as it checks that X
is not a sparse matrix (IIUC). The implementation gets more complicated..
Regarding cvxpy, thanks for the help @FBruzzesi. Everything is installed now... But when running pytest
, there are 14 errors. I wonder if I should ignore that (Running tests removing my changes):
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes0-prior0-cfm0-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes1-prior1-cfm1-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes2-prior2-cfm2-0.973]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes3-prior3-cfm3-0.985]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes4-prior4-cfm4-1.0]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes5-prior5-cfm5-0.2]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes6-prior6-cfm6-0.1]
ERROR tests/test_meta/test_subjective_classifier.py::test_posterior_computation[classes7-prior7-cfm7-0.696]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[predict_proba-expected_probas0]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[confusion_matrix-expected_probas1]
ERROR tests/test_meta/test_subjective_classifier.py::test_predict_proba[both-expected_probas2]
ERROR tests/test_preprocessing/test_outlier_remover.py::test_no_outliers
ERROR tests/test_preprocessing/test_outlier_remover.py::test_remove_outlier
ERROR tests/test_preprocessing/test_outlier_remover.py::test_do_not_refit
Maybe we should write which function we are working on, so we don't overlap?
I would like to work on PandasTypeSelector
if that's ok.
Scikit-learn has a very similar function than PandasTypeSelector
: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html
Is TypesSelector
still needed then? (I imagine that it is, just making sure here).
Some of those tools were added 5-6 years ago back when sklearn did not support those features. Scikit is a lot further now but on the scikit-lego side of things we've kept things around since we didn't really see a reason to remove them. Since we're interested in giving narwhals a spin here I'd argue it'd probably be good to keep them around if only because the implementation exercise will also demonstrate that it can be used in other projects.
Narwhals is not ready to easily work for PandasTypeSelector
yet.. per Marco's suggestion I went back to work with linear_models._FairClassifier
.
@FBruzzesi
Been looking into TimeGapSplit:
cv = TimeGapSplit(date_serie)
is instantiated using a pd.Series
which contains dates as values, and some indexcv.split(X, y)
then aligns X
with the index of date_serie
However, other than the index, X
has no knowledge of how it relates to date_serie
, so... ๐ค not sure how we're going to get this one to work for index-less libraries. Maybe for those, require that X
is the same length as date_serie
, and then align positionally?
@DeaMariaLeon sure, no problem for me! Let me know if I can help on the Narwhals side to make the missing bits available.
@MarcoGorelli TimeGapSplit
does a few things in a bit of unconventional way imho.
Maybe for those, require that X is the same length as date_serie, and then align positionally?
This is a good assumption in my opinion (but I am biased as I do the same in timebasedcv
)
@MarcoGorelli I may have made a commit bypassing review by accident ๐. Could you take a look at it when you have the time?
It should make InformationFilter
dataframe agnostic
It should make InformationFilter dataframe agnostic
Nice, looks good! looks like it preserves the existing logic
This has been quite a group effort, and much progress has been made - at what point do you think a release might be warranted? I'm presenting Narwhals at PyCon Italia in the 23rd of May (in 12 days), would be pretty cool if I could give a live demo of scikit-lego ๐ I understand if you'd rather wait until there's a bit more though, no pressure!
@MarcoGorelli that would be grand! Let's try to make a release within this week (or next weekend at latest?!).
Personally I would love to be able to replace PandasTypeSelector
as well and have a "Hey we replaced the entire dataframe module to be agnostic"-release ๐
would be pretty cool if I could give a live demo of scikit-lego ๐ I understand if you'd rather wait until there's a bit more though, no pressure!
@MarcoGorelli the stuff that's already been done seems sufficient to me as a "demo". It's mainly to report that folks are indeed working on the integration and that sofar it seems to work.
Totally! I just meant, it would be nice to be able to demo this result:
In [15]: %timeit add_lags(df, ['a', 'b', 'c'], [1, 2, 3, 4, 5])
4.35 ms ยฑ 206 ยตs per loop (mean ยฑ std. dev. of 7 runs, 100 loops each)
In [16]: %timeit pl.from_pandas(add_lags(df.to_pandas(), ['a', 'b', 'c'], [1, 2, 3, 4, 5]))
140 ms ยฑ 1.83 ms per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)
In [17]: type(df)
Out[17]: polars.dataframe.frame.DataFrame
and say something like "This is why "just convert to pandas" isn't a satisfying solution. Native Polars support brings you library to another level - you can use this today, go out and pip install -U sklego
and pass in Polars dataframes as you please!"
Let's try to make a release within this week (or next weekend at latest?!).
Sounds great! Given the pace at which things are moving, it may be realistic to have the whole thing done by then. If in a couple of places, non-pandas input is converted to pandas under the hood (like the grouped predictor in https://github.com/koaning/scikit-lego/pull/667) then TBH I think that's fine as a temporary solution. And I don't think it would be temporary for too long ๐
Consider creating another class TypeSelector and deprecate this one [PandasTypeSelector]
Been thinking about this one. I think what would be good to aim for would be:
PandasTypeSelector
shouldn't notice the changeOne suggestion I have is:
select_dtypes
(no change from the status quo)Like this, scikit-lego keeps the same API for pandas and non-pandas users. For pandas, it keeps using select_dtypes
(so, perfect backwards compatibility), whereas for non-pandas it uses Narwhals' selectors.by_dtype
(which I still need to implement, but it shouldn't be too bad) under the hood but keeps the existing user-facing API
An alternative would be to let users passing in Narwhals dtypes, but I'm not sure that end users should know about Narwhals ๐ค
I just merged #669 and #670. I would argue we are in a very good position for a release, I will open a PR to merge the dev branch into main and bump version.
I am sorry that I am a bit stuck on the GroupedPredictor for now, but it will get there soon(ish) ๐
cool, thanks!
granted, I'm slightly biased as I'm quite keen to present this as "you can try this today!" at the conference, but to be honest I don't think GroupedPredictor should be a blocker - we can definitely solve it, I don't see any theoretical reasons we can't do it, it'll just take a little longer
Closed by #671 Thanks everyone! This was a huge effort, very appreciated and we are quite enthusiastic by the result!
Just so folks know, it's all live now. Might make sense to see if we can collaborate on making a splash with an announcement?
Might make sense to see if we can collaborate on making a splash with an announcement?
Yup! The blog post is now live: https://labs.quansight.org/blog/scikit-lego-narwhals
Some posts I made:
Description
Creating this issue to keep track of which classes/function could benefit from adopting Narwhals.
preprocessing.ColumnDropper
preprocessing.ColumnSelector
preprocessing.PandasTypeSelector
TypeSelector
and deprecate this onecommon.TrainOnlyTransformerMixin
.to_numpy()
and hash the array data?model_selection.TimeGapSplit
model_selection.GroupedTimeSeriesSplit
projections.InformationFilter
meta.RegressionOutlierDetector
meta.hierarchical_predictor.py
meta.grouped_transformer.py
meta.grouped_predictor.py
linear_models._FairClassifier
pandas_utils.py
datasets.py
read_csv
functionPersonally I would wait to have at least
preprocessing.pandastransformers.py
entire migration before bumping to v0.9.0.cc @MarcoGorelli @anopsy
(*) Regarding
pandas_utils
Changing
log_step
to narwhals is fairly easy (~4 lines of code), however as this decorator is supposed to work for any function that operates on pandas, doing so would limit its functionality. It could be reasonable to add another one which although restricted to narwhals methods, it can interoperate with all its compatible dataframes.Legend
โ Done ๐ง WIP ๐ฒ Not Started ๐ซ Won't do