Closed FrizzoDavide closed 4 months ago
Hi!
I wasn't able to reproduce the error. Which version of scikit-learn are you using? I am using (1.4.1.post1)
I am using the following snippet to test it:
from ceruleo.dataset.catalog.CMAPSS import CMAPSSDataset, sensor_indices
from ceruleo.transformation import Transformer
from ceruleo.transformation.functional.pipeline.pipeline import make_pipeline
from ceruleo.transformation.features.selection import ByNameFeatureSelector
from ceruleo.transformation.features.scalers import MinMaxScaler
from ceruleo.models.sklearn import (
CeruleoRegressor,
EstimatorWrapper,
TimeSeriesWindowTransformer,
predict,
train_model,
CeruleoMetricWrapper
)
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
train_dataset = CMAPSSDataset(train=True, models='FD001')
test_dataset = CMAPSSDataset(train=False, models='FD001')[15:30]
# Define list of sensor measurement features
FEATURES = [train_dataset[0].columns[i] for i in sensor_indices]
transformer = Transformer(
pipelineX=make_pipeline(
ByNameFeatureSelector(features=FEATURES),
MinMaxScaler(range=(-1, 1))
),
pipelineY=make_pipeline(
ByNameFeatureSelector(features=['RUL']),
)
)
regressor_gs = CeruleoRegressor(
TimeSeriesWindowTransformer(
transformer,
window_size=32,
padding=True,
step=1),
Ridge(alpha=15))
grid_search = GridSearchCV(
estimator=regressor_gs,
param_grid={
'ts_window_transformer__window_size': [5, 10],
'regressor': [Ridge(alpha=15), RandomForestRegressor(max_depth=5)]
},
scoring=CeruleoMetricWrapper('neg_mean_absolute_error'),
verbose=5
)
grid_search.fit(train_dataset)
I checked my scikit-learn
version and I have 1.2.2
. So I upgraded it to 1.4.1.post1
and I tried to run the code snippet you provided but I am getting a new error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 20
1 regressor_gs = CeruleoRegressor(
2 TimeSeriesWindowTransformer(
3 transformer,
(...)
6 step=1),
7 Ridge(alpha=15))
9 grid_search = GridSearchCV(
10 estimator=regressor_gs,
11 param_grid={
(...)
16 verbose=5
17 )
---> 20 grid_search.fit(train_dataset)
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/base.py:1474, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1467 estimator._validate_params()
1469 with config_context(
1470 skip_parameter_validation=(
1471 prefer_skip_nested_validation or global_skip_validation
1472 )
1473 ):
-> 1474 return fit_method(estimator, *args, **kwargs)
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/model_selection/_search.py:970, in BaseSearchCV.fit(self, X, y, **params)
964 results = self._format_results(
965 all_candidate_params, n_splits, all_out, all_more_results
966 )
968 return results
--> 970 self._run_search(evaluate_candidates)
972 # multimetric is determined here because in the case of a callable
973 # self.scoring the return type is only known after calling
974 first_test_score = all_out[0]["test_scores"]
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/model_selection/_search.py:1527, in GridSearchCV._run_search(self, evaluate_candidates)
1525 def _run_search(self, evaluate_candidates):
1526 """Search all candidates in param_grid"""
-> 1527 evaluate_candidates(ParameterGrid(self.param_grid))
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/model_selection/_search.py:916, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
908 if self.verbose > 0:
909 print(
910 "Fitting {0} folds for each of {1} candidates,"
911 " totalling {2} fits".format(
912 n_splits, n_candidates, n_candidates * n_splits
913 )
914 )
--> 916 out = parallel(
917 delayed(_fit_and_score)(
918 clone(base_estimator),
919 X,
920 y,
921 train=train,
922 test=test,
923 parameters=parameters,
924 split_progress=(split_idx, n_splits),
925 candidate_progress=(cand_idx, n_candidates),
926 **fit_and_score_kwargs,
927 )
928 for (cand_idx, parameters), (split_idx, (train, test)) in product(
929 enumerate(candidate_params),
930 enumerate(cv.split(X, y, **routed_params.splitter.split)),
931 )
932 )
934 if len(out) < 1:
935 raise ValueError(
936 "No fits were performed. "
937 "Was the CV iterator empty? "
938 "Were there no candidates?"
939 )
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/parallel.py:67, in Parallel.__call__(self, iterable)
62 config = get_config()
63 iterable_with_config = (
64 (_with_config(delayed_func, config), args, kwargs)
65 for delayed_func, args, kwargs in iterable
66 )
---> 67 return super().__call__(iterable_with_config)
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/joblib/parallel.py:1918, in Parallel.__call__(self, iterable)
1916 output = self._get_sequential_output(iterable)
1917 next(output)
-> 1918 return output if self.return_generator else list(output)
1920 # Let's create an ID that uniquely identifies the current call. If the
1921 # call is interrupted early and that the same instance is immediately
1922 # re-used, this id will be used to prevent workers that were
1923 # concurrently finalizing a task from the previous call to run the
1924 # callback.
1925 with self._lock:
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/joblib/parallel.py:1847, in Parallel._get_sequential_output(self, iterable)
1845 self.n_dispatched_batches += 1
1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
1848 self.n_completed_tasks += 1
1849 self.print_progress()
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/parallel.py:129, in _FuncWrapper.__call__(self, *args, **kwargs)
127 config = {}
128 with config_context(**config):
--> 129 return self.function(*args, **kwargs)
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:887, in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, score_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
883 estimator = estimator.set_params(**clone(parameters, safe=False))
885 start_time = time.time()
--> 887 X_train, y_train = _safe_split(estimator, X, y, train)
888 X_test, y_test = _safe_split(estimator, X, y, test, train)
890 result = {}
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/metaestimators.py:158, in _safe_split(estimator, X, y, indices, train_indices)
156 X_subset = X[np.ix_(indices, train_indices)]
157 else:
--> 158 X_subset = _safe_indexing(X, indices)
160 if y is not None:
161 y_subset = _safe_indexing(y, indices)
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/__init__.py:411, in _safe_indexing(X, indices, axis)
409 return _polars_indexing(X, indices, indices_dtype, axis=axis)
410 elif hasattr(X, "shape"):
--> 411 return _array_indexing(X, indices, indices_dtype, axis=axis)
412 else:
413 return _list_indexing(X, indices, indices_dtype)
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/sklearn/utils/__init__.py:208, in _array_indexing(array, key, key_dtype, axis)
206 if isinstance(key, tuple):
207 key = list(key)
--> 208 return array[key, ...] if axis == 0 else array[:, key]
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/ceruleo/dataset/ts_dataset.py:118, in AbstractTimeSeriesDataset.__getitem__(self, i)
116 if isinstance(i, Iterable):
117 if not all(isinstance(item, (int, np.integer)) for item in i):
--> 118 raise ValueError("Invalid iterable index passed")
120 return FoldedDataset(self, i)
121 else:
ValueError: Invalid iterable index passed
In a nutshell the method __getitem__
of class AbstractTimeSeriesDataset
raises a ValueError
because the indexes passed to it to select a life (or a set of lives) from the dataset are not integers. However I do not have control on this step because it is all done by the GridSearchCV
function I guess.
Are you using the last version of Ceruleo?
because right now, the index acces is a bit different to the one you have
File ~/anaconda3/envs/bearing/lib/python3.11/site-packages/ceruleo/dataset/ts_dataset.py:118, in AbstractTimeSeriesDataset.__getitem__(self, i)
116 if isinstance(i, Iterable):
117 if not all(isinstance(item, (int, np.integer)) for item in i):
--> 118 raise ValueError("Invalid iterable index passed")
120 return FoldedDataset(self, i)
121 else:
Ok I had an old version of ceruleo. Now I installed the current version (3.0.3) and the code is working.
Thanks a lot for the help
The
GridSearchCV
class, that works together withCeruleoMetricWrapper
gives an error when a dataset including a Scaler in itsTransformer
is used.Let's consider the following example:
CMAPSSDataset
and define theFEATURES
as the most relevant sensor measurement features:Transformer
Note that here I am using the
MinMaxScaler
to scale the data in the range(-1,1)
GridSearchCV
to find compare different Regression modelsThe output returned after the
grid_search.fit(train_dataset)
command is launched is:And then the final error message is:
So probably there are two operands that should be subtracted one to the other but being both strings this results in an error.
Looking at additional details in the long error message returned we can find:
As suggested in the error message I added the
error_score='raise'
input argument toGridSearchCV
to get a more detailed error explanation.Looking at the new error message I think that the source of the error is in the
transform
method of theMinMaxScaler
class contained inceruleo.transformation.features.scalers
:The one reported above is the subtraction where two strings are found in correspondence of
self.data_max
andself.data_min
thus creating theTypeError: unsupported operand type(s) for -: 'str' and 'str'
reported above.I tried to run the code again after placing an
import ipdb; ipdb.set_trace()
inside thetransform
function but for some reason the code did not stop as it is supposed to happen when usingipdb
.I was also able to access the
data_max
anddata_min
attributes withtransformer.pipelineX.final_step.data_max
andtransformer.pipelineX.final_step.data_min
and I was also able to do:without any error.
So I do not really have a clue why this bug appears.
Obviously running the code without
MinMaxScaler
, so with:it works without any errors.