Closed TheooJ closed 1 week ago
Should we rename self._matching
into self._matcher
, as well as the associated files _matching
into _matcher
? Or do we keep this for an upcoming PR ?
Should we rename
self._matching
intoself._matcher
, as well as the associated files_matching
into_matcher
? Or do we keep this for an upcoming PR ?
I would prefer to leave it for another PR
the CI failure is due to the fact polars support in the column transformer was only added in scikit-learn 1.4, when polars was added to the set_output API: https://github.com/scikit-learn/scikit-learn/pull/27315/files#diff-66316a90c42bbe375eef74f1b9dcbd6a1b5b99f60feb692f676067a633d30f60R226
I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings
I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings
if we do this we get another error when scikit-learn tries to access .ndims
. but actually here we are just vectorizing the key columns and we want the output as numpy or a scipy sparse matrix anyway, I think we can just convert to pandas before vectorizing.
diff --git a/skrub/_joiner.py b/skrub/_joiner.py
index 0c325a09..795f71c2 100644
--- a/skrub/_joiner.py
+++ b/skrub/_joiner.py
@@ -5,11 +5,13 @@ The Joiner provides fuzzy joining as a scikit-learn transformer.
from functools import partial
import numpy as np
+import sklearn
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
+from sklearn.utils.fixes import parse_version
from sklearn.utils.validation import check_is_fitted
from . import _dataframe as sbd
@@ -39,6 +41,12 @@ _MATCHERS = {
DEFAULT_REF_DIST = "random_pairs"
+def _compat_df(df):
+ if parse_version(sklearn.__version__) < parse_version("1.4"):
+ return sbd.to_pandas(df)
+ return df
+
+
def _make_vectorizer(table, string_encoder, rescale):
"""Construct the transformer used to vectorize joining columns.
@@ -299,7 +307,7 @@ class Joiner(TransformerMixin, BaseEstimator):
rescale=self.ref_dist != "no_rescaling",
)
aux = self.vectorizer_.fit_transform(
- s.select(self._aux_table, s.cols(*self._aux_key))
+ _compat_df(s.select(self._aux_table, s.cols(*self._aux_key)))
)
self._matching.fit(aux)
return self
@@ -327,7 +335,11 @@ class Joiner(TransformerMixin, BaseEstimator):
X, self._aux_table, self.suffix, main_table_name="X"
)
main = self.vectorizer_.transform(
- sbd.set_column_names(s.select(X, s.cols(*self._main_key)), self._aux_key)
+ _compat_df(
+ sbd.set_column_names(
+ s.select(X, s.cols(*self._main_key)), self._aux_key
+ )
+ )
)
match_result = self._matching.match(main, self.max_dist_)
matching_col = match_result["index"].copy()
this fixes the issue
@TheooJ I opened a pr on your branch to fix the issue for old scikit-learn versions: https://github.com/TheooJ/skrub/pull/2
Refactoring the
Joiner
:Joiner
&fuzzy_join()
left_join()
to have a single join utilsagg_joiner
to remove_pandas/_polars.join()
with_columns