scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.79k stars 25.34k forks source link

Adding a log1p transformer compatible with Pipeline and Grid Search #28452

Open AlexandreGazagnes opened 8 months ago

AlexandreGazagnes commented 8 months ago

Describe the workflow you want to enable

Using a pipeline and a grid search, i want to check if it is better to pass log1p functions some columns, depending a skew threshold.

The code should go like this for the pipeline :


pipeline = Pipeline(
    [
        ("log1p", LogColumnTransformer()),
        ("scaler", StandardScaler()),
        ("estimator", RandomForestClassifier()),
    ]
)

for the grid_search :


param_grid = {
    "log1p__threshold": [0.5, 1, 1.5, 3, 3.5],
    "scaler": [StandardScaler(), "passthrough"],
    "estimator__n_estimators": [100, 200, 300],
}

and of course :

grid = GridSearchCV(
    pipeline,
    param_grid=param_grid,
    cv=5,
    refit=True,
    return_train_score=True,
    n_jobs=-1,
    verbose=0,
)

Describe your proposed solution

I have already implemented such a class and it works.

I Think this is not a sufficient qa code to be intergrated to sklearn but such a feature should be a good idea.

An indicative source code could be found here : file

Just as an option, here's the code :


class LogColumnTransformer(BaseEstimator, TransformerMixin):
    """Logarithm transformer for columns with high skewness"""

    threshold = SkewThreshold()
    ignore_int = Bool()
    force_df_out = Bool()

    def __init__(
        self,
        threshold: int | float = 3,
        ignore_int: bool = False,
        force_df_out: bool = False,
    ) -> None:
        """Init method"""

        if not isinstance(threshold, (float, int)):
            raise TypeError("threshold must be a float or an integer")

        if not isinstance(force_df_out, (int, bool)):
            raise TypeError("out must be a boolean")

        self.force_df_out = force_df_out
        self.ignore_int = ignore_int
        self.threshold = threshold
        self._log_cols = None
        self._standard_cols = None
        self.fitted_columns = None

    def fit(
        self,
        X: pd.DataFrame | np.ndarray | list,
        y: None = None,
    ):
        """Fit method"""

        _X = manage_input(X)
        self.fitted_columns = _X.columns.tolist()

        manage_negatives(_X)

        _X = _X.select_dtypes(include=["number"])

        if self.ignore_int:
            _X = _X.select_dtypes(exclude=["int"])

        # compute skew
        skew = _X.skew().round(3).to_dict()

        self._log_cols = []
        self._standard_cols = []

        # use threshold
        for col in skew:
            if skew[col] >= self.threshold:
                self._log_cols.append(col)
            else:
                self._standard_cols.append(col)

        return self

    def transform(
        self,
        X: pd.DataFrame | np.ndarray | list,
        y: None = None,
    ) -> pd.DataFrame | np.ndarray:
        """Transform method"""

        _X = manage_input(X)
        _X = manage_columns(_X, self.fitted_columns)

        for col in self._log_cols:
            if not col in _X.columns:
                continue

            _X[col] = np.log1p(_X[col])

        return manage_output(_X, self.force_df_out)

Describe alternatives you've considered, if relevant

Additional context

A example notebook could be fond here : notebook

glemaitre commented 8 months ago

I think that we can achieve the the expected behaviour if we have the following PR in: https://github.com/scikit-learn/scikit-learn/pull/27722

You can do a pipeline with this SelectThreshold using a skweness function and then a FunctionTransformer by passing the np.log1p since we don't need to hold any state.