koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 118 forks source link

feat: Make grouped and hierarchical dataframe-agnostic #667

Closed FBruzzesi closed 6 months ago

FBruzzesi commented 6 months ago

Description

Addresses Grouped* and Hierarchical, following up to the comment I added in the thread. I will paste it for easier visibility:

I am having a hard time to think of another way of implementing meta estimators that use pandas just for it's grouping functionalities. Namely HierarchicalPredictors and Grouped* are not necessarily taking a dataframe input, but use pandas groupby's to iterate over different batch of data.

  • Would this be possible using pure numpy so that we don't need the pandas dependency? yes
  • Would this be significantly more inconvenient to maintain? absolutely yes

My opinion/suggestion is to leave them as they are in their logic.

If anything, we can allow arbitrary dataframe types as input, and use Narwhals to convert them to pandas and let the rest flow as is.

MarcoGorelli commented 6 months ago

πŸ€” are you sure about adding pyarrow as required? I think we should be able to solve this one

but use pandas groupby's to iterate over different batch of data.

Narwhals' DataFrame.GroupBy also lets you iterate over groups - I'll take a closer look at what's required, but this seems feasible

FBruzzesi commented 6 months ago

@MarcoGorelli oh I see what you mean! Regarding hierarchical I am 100% sure it is doable, I can do it tomorrow myself (Edit: ok maybe let's say 99 as I need some missing features)

For Grouped, internals are a bit messy, but I will give it a try.

FBruzzesi commented 6 months ago

I am really pushing the boundaries for GroupedPredictor, I think within another swipe I can finish this.

Current status is:

FBruzzesi commented 6 months ago

Fixed all the issues I had in the conversion and marking this as ready to review. I am quite happy with the results πŸ™ŒπŸΌπŸ™ŒπŸΌ

The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.

cc: @koaning @MarcoGorelli

MarcoGorelli commented 6 months ago

Wow, amazing!

The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.

This is only when starting from non-dataframe input (e.g. numpy) right? If so, then would a patch like

diff --git a/sklego/meta/_grouped_utils.py b/sklego/meta/_grouped_utils.py
index 97180ab..e744f5a 100644
--- a/sklego/meta/_grouped_utils.py
+++ b/sklego/meta/_grouped_utils.py
@@ -20,7 +20,6 @@ def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
     _data_format_checks(X)

     # Convert X to Narwhals frame
-    X = nw.from_native(X, strict=False, eager_only=True)
     if not isinstance(X, nw.DataFrame):
         X = nw.from_native(pd.DataFrame(X))

diff --git a/sklego/meta/grouped_transformer.py b/sklego/meta/grouped_transformer.py
index 9f6b0d9..9c2eac3 100644
--- a/sklego/meta/grouped_transformer.py
+++ b/sklego/meta/grouped_transformer.py
@@ -109,6 +109,9 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
         self.fallback_ = None
         self.groups_ = as_list(self.groups) if self.groups is not None else None

+        X = nw.from_native(X, strict=False, eager_only=True)
+        if not isinstance(X, nw.DataFrame) and self.groups_ is not None:
+            self.groups_ = [X.shape[1]+group if isinstance(group, int) and group<0 else group for group in self.groups_]
         frame = parse_X_y(X, y, self.groups_, check_X=self.check_X, **self._check_kwargs)

         if self.groups is None:
@@ -181,6 +184,7 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
         """
         check_is_fitted(self, ["fallback_", "transformers_"])

+        X = nw.from_native(X, strict=False, eager_only=True)
         frame = parse_X_y(X, y=None, groups=self.groups_, check_X=self.check_X, **self._check_kwargs).drop(
             "__sklego_target__"
         )
diff --git a/tests/test_meta/test_grouped_transformer.py b/tests/test_meta/test_grouped_transformer.py
index b9aae2c..18b28d5 100644
--- a/tests/test_meta/test_grouped_transformer.py
+++ b/tests/test_meta/test_grouped_transformer.py
@@ -291,7 +291,7 @@ def test_array_with_multiple_string_cols(penguins):
     X = penguins

     # BROKEN: Failing due to negative indexing... kind of an edge case
-    meta = GroupedTransformer(StandardScaler(), groups=[0, X.shape[1] - 1])
+    meta = GroupedTransformer(StandardScaler(), groups=[0, -1])

     transformed = meta.fit_transform(X)

allow you to preserve current behaviour?

FBruzzesi commented 6 months ago

That's a nice trick! Thanks Marco! Pushing it now

FBruzzesi commented 6 months ago

This should be ready for review. Once we merge into narwhals-development then I would say that we covered large majority of #658 and it would be time for a (quite large) release. @koaning WDYT?