Why do you use .predict instead of .predict_proba at .fit?

scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection

BSD 3-Clause "New" or "Revised" License

480 stars 106 forks source link

Why do you use .predict instead of .predict_proba at .fit? #133

Closed PhilipMay closed 3 years ago

PhilipMay commented 5 years ago

Hi, I saw that your DES implementations (KNORAU, KNORAE and DESP) are using .predict of the pool_classifiers. But .predict is only returning 0 or 1 at a binary classifier.

Why dont you call .predict_proba to get probabilities and work with them?

I mean with "normal" voting ensembles hard voting is not as good as soft voting. Isnt that the same case here? Would it lead to a besser result if you work with probabilities instead of binary values in this case?

Would be happy about a short explanation.

PS: Thanks a lot for your nice library and your contribution!

Thanks Philip

Menelau commented 5 years ago

@PhilipMay Hi,

Yes, 'soft' voting is usually better than 'hard' voting. So far we have only considered hard voting (majority voting) just as a matter of simplicity. However, now that the library is becoming more mature, we have plans to allow other combination methods based on probability estimates in the next releases.

In fact, we already implemented, on the utils. aggregation module, some utility routines for standard combination methods based on probabilities (averaging, max, product etc...). So one of the plan for future releases is to allow the user to choose the combination method that should be used to aggregate the outputs of the pool of classifiers instead of aways using the "normal" voting scheme.

PhilipMay commented 4 years ago

Hi @Menelau did you add any soft voting kind of behavior since your last answer? Thanks Philip

Menelau commented 4 years ago

Hello @PhilipMay ,

I have it on a development branch but have not pushed to master yet. I haven't decided on the best way to allow this option this functionality.

One option would be to mimic sklearn VotingClassifier which allows either 'hard' or 'soft' options for voting. That has the benefit of being pretty straightforward and easy to use (which I like) but would not allow different aggregation methods at the end (product, median, max, etc).

Another option is to allow the user to pass a string indicating any of the aggregation functions to use (having its default value to hard voting) which is more flexible but increases in complexity (and I'm not sure if users will really benefit from this feature).

Do you have any preference or suggestion? Settling on that I can send a PR with this functionality tomorrow.

PhilipMay commented 4 years ago

What about soft voting? Is it implemented? Why is this closed?

Menelau commented 4 years ago

Well it was closed for the lack of activity. My last comment was on May 12 and did not get any response since.

PhilipMay commented 4 years ago

Ahh ok.

I have it on a development branch but have not pushed to master yet.

I thought you were developing it.