scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
58.85k stars 25.14k forks source link

RFECV race condition on estimator #23533

Open cpiber opened 2 years ago

cpiber commented 2 years ago

In RFECV, at https://github.com/scikit-learn/scikit-learn/blob/fb3ed90fb501a755ce2938fb566bd0f6e2235054/sklearn/feature_selection/_rfe.py#L723-L726 the estimator is passed as-is to the fit function. Since fit modifies the object without copying, this is prone to race conditions (see example below).

Contrast this to BaseSearchCV, where the estimator is properly cloned: https://github.com/scikit-learn/scikit-learn/blob/fb3ed90fb501a755ce2938fb566bd0f6e2235054/sklearn/model_selection/_search.py#L823-L833


On my system, with parameter `n_jobs=-1`, I got the following error: ``` 5 fits failed with the following error: Traceback (most recent call last): File ".../site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File ".../site-packages/sklearn/feature_selection/_rfe.py", line 723, in fit scores = parallel( File ".../site-packages/joblib/parallel.py", line 1056, in __call__ self.retrieve() File ".../site-packages/joblib/parallel.py", line 935, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/usr/lib/python3.10/multiprocessing/pool.py", line 771, in get raise self._value File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File ".../site-packages/joblib/_parallel_backends.py", line 595, in __call__ return self.func(*args, **kwargs) File ".../site-packages/joblib/parallel.py", line 262, in __call__ return [func(*args, **kwargs) File ".../site-packages/joblib/parallel.py", line 262, in return [func(*args, **kwargs) File ".../site-packages/sklearn/utils/fixes.py", line 117, in __call__ return self.function(*args, **kwargs) File ".../site-packages/sklearn/feature_selection/_rfe.py", line 37, in _rfe_single_fit return rfe._fit( File ".../site-packages/sklearn/feature_selection/_rfe.py", line 327, in _fit self.scores_.append(step_score(self.estimator_, features)) File ".../site-packages/sklearn/feature_selection/_rfe.py", line 40, in lambda estimator, features: _score( File ".../site-packages/sklearn/model_selection/_validation.py", line 767, in _score scores = scorer(estimator, X_test, y_test) File ".../site-packages/sklearn/metrics/_scorer.py", line 219, in __call__ return self._score( File ".../site-packages/sklearn/metrics/_scorer.py", line 261, in _score y_pred = method_caller(estimator, "predict", X) File ".../site-packages/sklearn/metrics/_scorer.py", line 71, in _cached_call return getattr(estimator, method)(*args, **kwargs) File ".../site-packages/sklearn/ensemble/_forest.py", line 835, in predict return self.classes_.take(np.argmax(proba, axis=1), axis=0) AttributeError: 'list' object has no attribute 'take' ```

It is generated from the following snippet:

  rf = RandomForestClassifier()
  rfecv = RFECV(rf, scoring='accuracy', n_jobs=-1)
  rfecv.fit(X_train, y_train)

The error appears to happen because n_outputs_ is not constant between runs. The error does not happen without parallelism.

glemaitre commented 2 years ago

We should be cloning self.estimator in the parallel call I think. We usually always do that. Indeed, we don't want to mutate self.estimator if we follow our own API.