Bug Report: `GridSearchCV` does not work with Scalers

FrizzoDavide commented 4 months ago

The GridSearchCV class, that works together with CeruleoMetricWrapper gives an error when a dataset including a Scaler in its Transformer is used.

Let's consider the following example:

Load the CMAPSSDataset and define the FEATURES as the most relevant sensor measurement features:

# Load the dataset 
train_dataset = CMAPSSDataset(train=True, models='FD001')
test_dataset = CMAPSSDataset(train=False, models='FD001')[15:30]

# Define list of sensor measurement features
FEATURES = [train_dataset[0].columns[i] for i in sensor_indices]

Define a simple Transformer

transformer = Transformer(
    pipelineX=make_pipeline(
        ByNameFeatureSelector(features=FEATURES), 
        MinMaxScaler(range=(-1, 1))

    ), 
    pipelineY=make_pipeline(
        ByNameFeatureSelector(features=['RUL']),  
    )
)

Note that here I am using the MinMaxScaler to scale the data in the range (-1,1)

Define an instance of GridSearchCV to find compare different Regression models

regressor_gs = CeruleoRegressor(
    TimeSeriesWindowTransformer(
        transformer,
        window_size=32,
        padding=True,
        step=1),   
    Ridge(alpha=15))

grid_search = GridSearchCV(
    estimator=regressor_gs,
     param_grid={
        'ts_window_transformer__window_size': [5, 10],         
        'regressor': [Ridge(alpha=15), RandomForestRegressor(max_depth=5)]
    },
    scoring=CeruleoMetricWrapper('neg_mean_absolute_error')
)

grid_search.fit(train_dataset)

The output returned after the grid_search.fit(train_dataset) command is launched is:

There was an error when transforming with MinMaxScaler
There was an error when transforming with MinMaxScaler
There was an error when transforming with MinMaxScaler
...

And then the final error message is:

TypeError: unsupported operand type(s) for -: 'str' and 'str'

So probably there are two operands that should be subtracted one to the other but being both strings this results in an error.

Looking at additional details in the long error message returned we can find:

ValueError: 
All the 20 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

As suggested in the error message I added the error_score='raise' input argument to GridSearchCV to get a more detailed error explanation.

Looking at the new error message I think that the source of the error is in the transform method of the MinMaxScaler class contained in ceruleo.transformation.features.scalers:

def transform(self, X: pd.DataFrame) -> pd.DataFrame:

try:
    divisor = self.data_max - self.data_min

The one reported above is the subtraction where two strings are found in correspondence of self.data_max and self.data_min thus creating the TypeError: unsupported operand type(s) for -: 'str' and 'str' reported above.

I tried to run the code again after placing an import ipdb; ipdb.set_trace() inside the transform function but for some reason the code did not stop as it is supposed to happen when using ipdb.

I was also able to access the data_max and data_min attributes with transformer.pipelineX.final_step.data_max and transformer.pipelineX.final_step.data_min and I was also able to do:

transformer.pipelineX.final_step.data_max-transformer.pipelineX.final_step.data_min

without any error.

So I do not really have a clue why this bug appears.

Obviously running the code without MinMaxScaler, so with:

transformer = Transformer(
    pipelineX=make_pipeline(
        ByNameFeatureSelector(features=FEATURES), 
    ), 
    pipelineY=make_pipeline(
        ByNameFeatureSelector(features=['RUL']),  
    )
)

it works without any errors.

lucianolorenti commented 4 months ago

Hi!

I wasn't able to reproduce the error. Which version of scikit-learn are you using? I am using (1.4.1.post1)

I am using the following snippet to test it:

from ceruleo.dataset.catalog.CMAPSS import CMAPSSDataset, sensor_indices
from ceruleo.transformation import Transformer
from ceruleo.transformation.functional.pipeline.pipeline import make_pipeline
from ceruleo.transformation.features.selection import ByNameFeatureSelector
from ceruleo.transformation.features.scalers import MinMaxScaler
from ceruleo.models.sklearn import (
    CeruleoRegressor,
    EstimatorWrapper,
    TimeSeriesWindowTransformer,
    predict,
    train_model,
    CeruleoMetricWrapper
)
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

train_dataset = CMAPSSDataset(train=True, models='FD001')
test_dataset = CMAPSSDataset(train=False, models='FD001')[15:30]

# Define list of sensor measurement features
FEATURES = [train_dataset[0].columns[i] for i in sensor_indices]

transformer = Transformer(
    pipelineX=make_pipeline(
        ByNameFeatureSelector(features=FEATURES), 
        MinMaxScaler(range=(-1, 1))

    ), 
    pipelineY=make_pipeline(
        ByNameFeatureSelector(features=['RUL']),  
    )
)

regressor_gs = CeruleoRegressor(
    TimeSeriesWindowTransformer(
        transformer,
        window_size=32,
        padding=True,
        step=1),   
    Ridge(alpha=15))

grid_search = GridSearchCV(
    estimator=regressor_gs,
     param_grid={
        'ts_window_transformer__window_size': [5, 10],         
        'regressor': [Ridge(alpha=15), RandomForestRegressor(max_depth=5)]
    },
    scoring=CeruleoMetricWrapper('neg_mean_absolute_error'),
    verbose=5
)

grid_search.fit(train_dataset)

FrizzoDavide commented 4 months ago

I checked my scikit-learn version and I have 1.2.2. So I upgraded it to 1.4.1.post1 and I tried to run the code snippet you provided but I am getting a new error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 20
      1 regressor_gs = CeruleoRegressor(
      2     TimeSeriesWindowTransformer(
      3         transformer,
   (...)
      6         step=1),   
      7     Ridge(alpha=15))
      9 grid_search = GridSearchCV(
     10     estimator=regressor_gs,
     11      param_grid={
   (...)
     16     verbose=5
     17 )
---> 20 grid_search.fit(train_dataset)

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/base.py:1474, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1467     estimator._validate_params()
   1469 with config_context(
   1470     skip_parameter_validation=(
   1471         prefer_skip_nested_validation or global_skip_validation
   1472     )
   1473 ):
-> 1474     return fit_method(estimator, *args, **kwargs)

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/model_selection/_search.py:970, in BaseSearchCV.fit(self, X, y, **params)
    964     results = self._format_results(
    965         all_candidate_params, n_splits, all_out, all_more_results
    966     )
    968     return results
--> 970 self._run_search(evaluate_candidates)
    972 # multimetric is determined here because in the case of a callable
    973 # self.scoring the return type is only known after calling
    974 first_test_score = all_out[0]["test_scores"]

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/model_selection/_search.py:1527, in GridSearchCV._run_search(self, evaluate_candidates)
   1525 def _run_search(self, evaluate_candidates):
   1526     """Search all candidates in param_grid"""
-> 1527     evaluate_candidates(ParameterGrid(self.param_grid))

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/model_selection/_search.py:916, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    908 if self.verbose > 0:
    909     print(
    910         "Fitting {0} folds for each of {1} candidates,"
    911         " totalling {2} fits".format(
    912             n_splits, n_candidates, n_candidates * n_splits
    913         )
    914     )
--> 916 out = parallel(
    917     delayed(_fit_and_score)(
    918         clone(base_estimator),
    919         X,
    920         y,
    921         train=train,
    922         test=test,
    923         parameters=parameters,
    924         split_progress=(split_idx, n_splits),
    925         candidate_progress=(cand_idx, n_candidates),
    926         **fit_and_score_kwargs,
    927     )
    928     for (cand_idx, parameters), (split_idx, (train, test)) in product(
    929         enumerate(candidate_params),
    930         enumerate(cv.split(X, y, **routed_params.splitter.split)),
    931     )
    932 )
    934 if len(out) < 1:
    935     raise ValueError(
    936         "No fits were performed. "
    937         "Was the CV iterator empty? "
    938         "Were there no candidates?"
    939     )

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/parallel.py:67, in Parallel.__call__(self, iterable)
     62 config = get_config()
     63 iterable_with_config = (
     64     (_with_config(delayed_func, config), args, kwargs)
     65     for delayed_func, args, kwargs in iterable
     66 )
---> 67 return super().__call__(iterable_with_config)

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
   1845 self.n_dispatched_batches += 1
   1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
   1848 self.n_completed_tasks += 1
   1849 self.print_progress()

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/parallel.py:129, in _FuncWrapper.__call__(self, *args, **kwargs)
    127     config = {}
    128 with config_context(**config):
--> 129     return self.function(*args, **kwargs)

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:887, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, score_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    883     estimator = estimator.set_params(**clone(parameters, safe=False))
    885 start_time = time.time()
--> 887 X_train, y_train = _safe_split(estimator, X, y, train)
    888 X_test, y_test = _safe_split(estimator, X, y, test, train)
    890 result = {}

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/metaestimators.py:158, in _safe_split(estimator, X, y, indices, train_indices)
    156         X_subset = X[np.ix_(indices, train_indices)]
    157 else:
--> 158     X_subset = _safe_indexing(X, indices)
    160 if y is not None:
    161     y_subset = _safe_indexing(y, indices)

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/__init__.py:411, in _safe_indexing(X, indices, axis)
    409     return _polars_indexing(X, indices, indices_dtype, axis=axis)
    410 elif hasattr(X, "shape"):
--> 411     return _array_indexing(X, indices, indices_dtype, axis=axis)
    412 else:
    413     return _list_indexing(X, indices, indices_dtype)

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/__init__.py:208, in _array_indexing(array, key, key_dtype, axis)
    206 if isinstance(key, tuple):
    207     key = list(key)
--> 208 return array[key, ...] if axis == 0 else array[:, key]

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/ceruleo/dataset/ts_dataset.py:118, in AbstractTimeSeriesDataset.__getitem__(self, i)
    116 if isinstance(i, Iterable):
    117     if not all(isinstance(item, (int, np.integer)) for item in i):
--> 118         raise ValueError("Invalid iterable index passed")
    120     return FoldedDataset(self, i)
    121 else:

ValueError: Invalid iterable index passed

In a nutshell the method __getitem__ of class AbstractTimeSeriesDataset raises a ValueError because the indexes passed to it to select a life (or a set of lives) from the dataset are not integers. However I do not have control on this step because it is all done by the GridSearchCV function I guess.

lucianolorenti commented 4 months ago

Are you using the last version of Ceruleo?

because right now, the index acces is a bit different to the one you have

File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/ceruleo/dataset/ts_dataset.py:118, in AbstractTimeSeriesDataset.__getitem__(self, i)
    116 if isinstance(i, Iterable):
    117     if not all(isinstance(item, (int, np.integer)) for item in i):
--> 118         raise ValueError("Invalid iterable index passed")
    120     return FoldedDataset(self, i)
    121 else:

https://github.com/lucianolorenti/ceruleo/blob/4cc7edbfb50ca03819613170b22bad179bb07266/ceruleo/dataset/ts_dataset.py#L139

FrizzoDavide commented 4 months ago

Ok I had an old version of ceruleo. Now I installed the current version (3.0.3) and the code is working.

Thanks a lot for the help

lucianolorenti / ceruleo

Bug Report: `GridSearchCV` does not work with Scalers #36