scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection
BSD 3-Clause "New" or "Revised" License
477 stars 106 forks source link

Can't obtain oracle scores when using Random Subspaces as the pool method #238

Closed amorimlb closed 3 years ago

amorimlb commented 3 years ago

Hi, Its more a question than an issue, actually (or maybe that's an issue too). I was able to use the following lines to calculate the oracle on a pool of classifiers using the Bagging method to form the pool:

...
model_bagging = BaggingClassifier(base_estimator=Perceptron(random_state=0), 
                                  n_estimators=n, random_state=0, bootstrap=True, max_features=1.0, n_jobs=-1)
model_bagging.fit(X_train,y_train)
oracle = Oracle(model_bagging).fit(X_train, y_train)
oracle_score = oracle.score(X_test, y_test)
print(oracle_score)

It works nicely. But, whenever I use the BaggingClassifier class to create a Random Subspace of the features, I get an error as follows:

...
model_randsubspaces = BaggingClassifier(base_estimator=Perceptron(random_state=0), 
                                  n_estimators=n, random_state=0, bootstrap=False, max_features=0.5, n_jobs=-1)
model_randsubspaces.fit(X_train,y_train)
oracle = Oracle(model_randsubspaces).fit(X_train, y_train)
oracle_score = oracle.score(X_test, y_test)
print(oracle_score)

And this outputs lots of lines referring to the "oracle.score" line in my code ending with the following line:

ValueError: X has 9 features per sample; expecting 4

My data has 9 features on total, but I am doing a 50% feature reduction when I run the Random Subspace, as can be seen. In this case, how could I calculate the oracle?

Menelau commented 3 years ago

Hello,

Unfortunately, BaggingClassifier cannot be used as an implementation of the RandomSusbspace method in this case. The problem is that when a BaggingClassifier generates a list of base models, each model is trained considering just a subset of the features in X instead of receiving the whole set as input and selecting the appropriate ones inside. The predict and predict_proba methods of the bagging algorithm can handle that since the class maintains the information of which feature was used for the training of each base model and distribute it correctly during prediction.

However, when you pass a bagging classifier instance to the DESlib, DESlib only sees the list of trained models and will access each base model individually having no information about which features from the input X were used for the training of each base model. So it will always pass the full array which will give an error. If you try to get a single base model from the BaggingClassifier instance and try to pass X_test as the input you will observe the same error:

model_randsubspaces[0].predict(X_test)

What you can do in this case is to instead of using the BaggingClassifier to get a subset of features create your own random subspace method using a scikit-learn pipeline. In this case, each base model consisting of a transformation that selects 50% of features and pass those to the classification model at the end. That way all base classifiers in the pool can receive the same X as input, and the subspace will be handled inside the pipeline.

amorimlb commented 3 years ago

Hi Rafael! Thank you so much for your quick answer. It seems to be a good idea, but since I have no experience with pipelines, I ended up writing my own oracle, like this:

base_models = meta_model.estimators_
base_models_feats = meta_model.estimators_features_

base_models_preds = []
for i in range(len(base_models)):
    X_test_subspace = X_test.iloc[:,base_models_feats[i]] #selecting only the columns used for the ith base model.
    y_pred = base_models[i].predict(X_test_subspace)
    base_models_preds.append(y_pred)

oracle_hits = []
for i in range(len(y_test)):
    oracle_hit = 0
    for j in range(len(base_models_preds)):
        if base_models_preds[j][i] == y_test[i]:
            oracle_hit = 1
            break
    oracle_hits.append(oracle_hit)

oracle_score = np.sum(oracle_hits)/len(oracle_hits)
print('Oracle score = ', oracle_score)

It is somewhat naive but it seems to work. 😁

jayahm commented 3 years ago

Hello, everyone,

Sorry for the interruption.

I think this question is exactly like my question in https://github.com/scikit-learn-contrib/DESlib/issues/223

I believe it would good if we have some code available to perform RSM to generate the pool.

Menelau commented 3 years ago

fixed in #254