scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection
BSD 3-Clause "New" or "Revised" License
479 stars 106 forks source link

Random subspace method #223

Closed jayahm closed 3 years ago

jayahm commented 3 years ago

Hi

I was trying to generate a pool of classifiers based on random subspace method suing BaggingClassifier, but got this error:

I used exactly your example code but only modified the bagging part.

Here is the code: https://www.dropbox.com/s/z4iseijtawb53ey/plot_comparing_dynamic_static.ipynb?dl=0

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-43bece03c684> in <module>
     70 scores = []
     71 for method, name in zip(methods, names):
---> 72     method.fit(X_dsel, y_dsel)
     73     scores.append(method.score(X_test, y_test))
     74     print("Classification accuracy {} = {}"

~\anaconda3\lib\site-packages\deslib\static\single_best.py in fit(self, X, y)
     80             y_encoded = self.enc_.transform(y)
     81 
---> 82         performances = self._estimate_performances(X, y_encoded)
     83         self.best_clf_index_ = np.argmax(performances)
     84         self.best_clf_ = self.pool_classifiers_[self.best_clf_index_]

~\anaconda3\lib\site-packages\deslib\static\single_best.py in _estimate_performances(self, X, y)
     90         for idx, clf in enumerate(self.pool_classifiers_):
     91             scorer = check_scoring(clf, self.scoring)
---> 92             performances[idx] = scorer(clf, X, y)
     93         return performances
     94 

~\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py in _passthrough_scorer(estimator, *args, **kwargs)
    369 def _passthrough_scorer(estimator, *args, **kwargs):
    370     """Function that wraps estimator.score"""
--> 371     return estimator.score(*args, **kwargs)
    372 
    373 

~\anaconda3\lib\site-packages\sklearn\base.py in score(self, X, y, sample_weight)
    367         """
    368         from .metrics import accuracy_score
--> 369         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    370 
    371 

~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in predict(self, X, check_input)
    417         """
    418         check_is_fitted(self)
--> 419         X = self._validate_X_predict(X, check_input)
    420         proba = self.tree_.predict(X)
    421         n_samples = X.shape[0]

~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in _validate_X_predict(self, X, check_input)
    389                              "match the input. Model n_features is %s and "
    390                              "input n_features is %s "
--> 391                              % (self.n_features_, n_features))
    392 
    393         return X

ValueError: Number of features of the model must match the input. Model n_features is 10 and input n_features is 20 
Menelau commented 3 years ago

Hello,

The library does not support this functionality yet. All models are required to have exactly the same input features.

jayahm commented 3 years ago

I see. I saw some papers used the random subspace method. So, I was thinking to try it. Any suggestion on how to it even though the library does not support yet?

Menelau commented 3 years ago

@jayahm Hello,

For that to work you need to implement the Random Subspace method outside BaggingClassifier from sklearn. In the case, each base model should be composed of a sklearn pipeline in which the first step is a transformation that selects a subset of the features, and the second step is the classifier model. That each, each base model in the list pool_classifiers receives the same X as input, and they handle the subspace selection by themselves.

Just by using the BaggingClassifier the problem is that each base model in the ensemble generated by the BaggingClassifier receives a subset of X as input directly instead of the full feature set as input and handling the subspace inside. And since there is no way to communicate which features were used for the training of each base, the DS technique cannot know how to correctly distribute between the base models.

I will work on an example showing how to properly use the RandomSubspace model with DESlib considering the current versions of sklearn and deslib, as well as other feature transformations that could be applied to specific base models. I will let you know when it is ready.

jayahm commented 3 years ago

Yes, an example would be very helpful as I cannot get what you meant in the explanation above.

jayahm commented 3 years ago

You closed the issue, which means you have created the example?

Menelau commented 3 years ago

Not yet. It was closed because this example is already being tracked by issue #218 , so there is no need to have both open.