scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection
BSD 3-Clause "New" or "Revised" License
479 stars 106 forks source link

Is the pool of heterogenous classifiers expected to be prefitted? #182

Closed zoj613 closed 4 years ago

zoj613 commented 4 years ago

Looking at the repo it seems like the checks expect the pool to be prefitted if pool_classifiers is not None. Not only that, but it also requires that the passed in data to fit be data not used in training the prefitted pool of heterogeneous classifiers? This doesn't seem to be emphasized in the documentation. Am I missing something? https://github.com/scikit-learn-contrib/DESlib/blob/a22defa871144b4e451364e0c2ba23db359d77f0/deslib/base.py#L207-L228

Menelau commented 4 years ago

@zoj613 Hello,

Yes, the current version is expecting the pool to be prefitted and that point is not very clear in the documentation. In fact, that is something that I want to change for the upcoming versions since we have already implemented routines to fit the pool of classifiers inside if the pool is None. So there is no reason for not accepting also an unfitted pool and doing everything inside the fitof a DS method.

About requiring a different dataset for fittin the DS method, is a practice used by many works in the dynamic selection literature, especially when using strong classifiers in the pool (e.g., SVMs) which could overfit certain regions in the feature space.

However, having a completely separate partition is not always required when the pool is composed of weak classifiers or when we are dealing with very small datasets. From my experience in these cases, either using the same data (or having partially overlap with the training data) helps in improving results. This point is discussed in the library tutorial: https://deslib.readthedocs.io/en/latest/user_guide/tutorial.html