Problems with probabilities and outputs

scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection

BSD 3-Clause "New" or "Revised" License

477 stars 106 forks source link

Problems with probabilities and outputs #261

Closed atifov closed 2 years ago

atifov commented 2 years ago

I have been using the Deslib library to classify a binary problem and used various DES algorithms. When I applied the model on the test dataset, I ran into some problems.

According to my understanding, for a binary classification problem, 0.5 is the default threshold. If the _"predictproba" is higher than 0.5, then it should be classified as 1, otherwise 0.

However, for some instances when the predicted probabilites were higher than 0.5, I got a 0 (instead of 1) and when the predicted probabilites were less than 0.5, I got a 1 (instead of 0).

I have never seen this kind of phenomenon whilst using the scikit-learn library, therefore I would like to know if this is normal behaviour in DES?

Please note that I am using the latest version (0.3.5) of DESlib.

Menelau commented 2 years ago

No this is not normal and there may be a problem regarding label encoding in your simulations. However, I cannot help you without a simple code example showing this happening.

atifov commented 2 years ago

Please note that I have used the same code with adjustments in several other scikit learn models, including both voting and stacked classifiers. This is the only time I got into this kind of situation.

Following is the main code:

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=0)

classifiers = [LogisticRegression(), RandomForestClassifier(), RUSBoostClassifier(), CatBoostClassifier(), DecisionTreeClassifier()]

for c in classifiers: c.fit(X_train, Y_train)

model = RRC(pool_classifiers=classifiers, random_state=0)

model.fit(X_train, Y_train)

y_pred = model.predict(X_test) y_test_prob = model.predict_proba(X_test)[:,1]

Menelau commented 2 years ago

Thanks for the code example! I will check it tomorrow (well it is quite late around here).

By the way, can you try in the meantime install the new version of the library (current master code) and see if this also happens? I'm suspecting this could happen due to the previous version of predict using "hard voting" and predict_proba using "soft voting". As these are very different types of classifiers and they were not calibrated (like Decision tree which will only output 0 or 1 probabilities in this case) this could be the problem you are experiencing. Combination rule at the end not being exactly the same.

atifov commented 2 years ago

Thanks for the prompt reply.

Please note that I have also used "Stacked classifiers" using the DESlib library with the same classifiers and did not encounter any problems. Moreover all five classifiers have the "predict_proba" function available in them.

As you suggested in the meanwhile, I will reinstall the latest version of the deslib library and re-run the code to see if the problem can be solved.

Menelau commented 2 years ago

Please try to install the master code using pip install git+https://github.com/scikit-learn-contrib/DESlib and see if the problem persists. If that solves your problem I will try to create a new version (0.3.6) with this fix tomorrow since that is quite a huge change in behavior and needs to be in the pypi package asap.

atifov commented 2 years ago

Following your advice and installing the master version, the issue has been resolved. Thanks a lot.