ENH Polars support in Joiner - Githubissues

skrub-data / skrub

Prepping tables for machine learning

https://skrub-data.org/

BSD 3-Clause "New" or "Revised" License

1.04k stars 91 forks source link

ENH Polars support in Joiner #945

Closed TheooJ closed 1 week ago

TheooJ commented 2 weeks ago

Refactoring the Joiner:

Support for polars dataframes, use selectors + the dispatch mechanism in Joiner & fuzzy_join()
Better handling of duplicate column names, and input dataframe checks
Dispatch left_join() to have a single join utils
- Cleanup in agg_joiner to remove _pandas/_polars.join()
Dispatch with_columns

TheooJ commented 1 week ago

Should we rename self._matching into self._matcher, as well as the associated files _matching into _matcher ? Or do we keep this for an upcoming PR ?

jeromedockes commented 1 week ago

Should we rename self._matching into self._matcher, as well as the associated files _matching into _matcher ? Or do we keep this for an upcoming PR ?

I would prefer to leave it for another PR

jeromedockes commented 1 week ago

the CI failure is due to the fact polars support in the column transformer was only added in scikit-learn 1.4, when polars was added to the set_output API: https://github.com/scikit-learn/scikit-learn/pull/27315/files#diff-66316a90c42bbe375eef74f1b9dcbd6a1b5b99f60feb692f676067a633d30f60R226

jeromedockes commented 1 week ago

I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings

jeromedockes commented 1 week ago

I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings

if we do this we get another error when scikit-learn tries to access .ndims. but actually here we are just vectorizing the key columns and we want the output as numpy or a scipy sparse matrix anyway, I think we can just convert to pandas before vectorizing.

diff --git a/skrub/_joiner.py b/skrub/_joiner.py
index 0c325a09..795f71c2 100644
--- a/skrub/_joiner.py
+++ b/skrub/_joiner.py
@@ -5,11 +5,13 @@ The Joiner provides fuzzy joining as a scikit-learn transformer.
 from functools import partial

 import numpy as np
+import sklearn
 from sklearn.base import BaseEstimator, TransformerMixin, clone
 from sklearn.compose import make_column_transformer
 from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import FunctionTransformer, StandardScaler
+from sklearn.utils.fixes import parse_version
 from sklearn.utils.validation import check_is_fitted

 from . import _dataframe as sbd
@@ -39,6 +41,12 @@ _MATCHERS = {
 DEFAULT_REF_DIST = "random_pairs"

+def _compat_df(df):
+    if parse_version(sklearn.__version__) < parse_version("1.4"):
+        return sbd.to_pandas(df)
+    return df
+
+
 def _make_vectorizer(table, string_encoder, rescale):
     """Construct the transformer used to vectorize joining columns.

@@ -299,7 +307,7 @@ class Joiner(TransformerMixin, BaseEstimator):
             rescale=self.ref_dist != "no_rescaling",
         )
         aux = self.vectorizer_.fit_transform(
-            s.select(self._aux_table, s.cols(*self._aux_key))
+            _compat_df(s.select(self._aux_table, s.cols(*self._aux_key)))
         )
         self._matching.fit(aux)
         return self
@@ -327,7 +335,11 @@ class Joiner(TransformerMixin, BaseEstimator):
             X, self._aux_table, self.suffix, main_table_name="X"
         )
         main = self.vectorizer_.transform(
-            sbd.set_column_names(s.select(X, s.cols(*self._main_key)), self._aux_key)
+            _compat_df(
+                sbd.set_column_names(
+                    s.select(X, s.cols(*self._main_key)), self._aux_key
+                )
+            )
         )
         match_result = self._matching.match(main, self.max_dist_)
         matching_col = match_result["index"].copy()

this fixes the issue

jeromedockes commented 1 week ago

@TheooJ I opened a pr on your branch to fix the issue for old scikit-learn versions: https://github.com/TheooJ/skrub/pull/2