scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection
BSD 3-Clause "New" or "Revised" License
480 stars 106 forks source link

How can we create pool of classifiers? #167

Closed sara-eb closed 4 years ago

sara-eb commented 5 years ago

As you have mentioned in your examples, BaggingClassifier or RandomeForest classifier are considered as a pool of classifier itself.

I am wondering is it possible if I create a pool of classifiers including traditional ensemble methods like RF, Adaboost in combination of single classifiers like SVM, kNN?

Thanks

Menelau commented 5 years ago

@sara-eb Hello,

Sorry for the delay response. The library accepts any list of classifiers as the pool of classifiers so it does accept a combination of ensemble methods with single classifier models. There are two ways of doing that:

`X ,y = make_classification() rf = RandomForestClassifier(n_estimators=10).fit(X, y) adaboost = AdaBoostClassifier(n_estimators=10).fit(X, y) svm = SVC().fit(X, y) tree = DecisionTreeClassifier().fit(X, y)

pool1 = [rf, adaboost , svm, tree] pool2 = rf.estimators + adaboost.estimators + [svm, tree]`

In the case, pool1 is a pool of classifiers composed of 4 estimators (although random forest and adaboost are composed of multiple base estimators, the DS method looks at them as a being a single one). pool2 treats each member of random forest/adaboost as a single, independent model instead of their combination. So, the DS model sees it as a pool composed of 22 models (10 coming from rf, 10 from adaboost, 1 svm and 1 decision tree).

You may want to check our heterogeneous example too in which we use classifiers of different types in the pool: https://deslib.readthedocs.io/en/latest/auto_examples/example_heterogeneous.html#sphx-glr-auto-examples-example-heterogeneous-py

sara-eb commented 5 years ago

@Menelau Thank you very much sir, Your explanation is very clear.

Thanks again

sara-eb commented 5 years ago

@Menelau I created a pool of classifiers for my data including a random forest with 200 estimators and an AdaBoost classifier with 600 decision trees, and I am using faiss technique as knn_type.

 pool_classifiers = [model_ada, model_rf]
knorae = KNORAE(pool_classifiers=pool_classifiers,
                    knn_classifier=knn_type)
    print("Fitting KNORAE on X_DSEL dataset")
    knorae.fit(X_DSEL, y_DSEL)

    print("Saving the dynamic selection model in ", ds_model_outdir)
    outfile = ds_model_outdir+'KNORAE_rfE200_adaDT600.joblib'
    print(outfile)
    dump(knorae, outfile)

Since my validation (i.e., DSEL) dataset is quite big number of samples, I was trying to fit the DS model on validation data and save the model for later prediction on test dataset. However, I am facing an issue of saving it: TypeError: can't pickle module objects

What could be the reason?

Menelau commented 5 years ago

@sara-eb Hello,

I have a feeling that it happens because of the information stored in the faiss knn but I'm not sure. I will investigate that and get back to you asap.

You can try using dill instead of pickle for saving the model: https://pypi.org/project/dill/ I believe that should work for you.

sara-eb commented 5 years ago

@Menelau

Thanks for recommendation, I installed dill and tried to save with dill:


pickle_filename = ds_model_outdir+'KNORAE_rfE200_adaDT600.pkl'
pickle.dump(knorae, open(pickle_filename,'wb'))

However, still getting error; TypeError: can't pickle SwigPyObject objects

I have traind RandomForest classifier in parallel, can this be the reason?

Menelau commented 5 years ago

@sara-eb ,

Parallel random forest shouldn’t be a problem at all. I dig deeper into this issue and I found a problem with the serialization of the Faiss KNN. In the case, the index computed by the faiss knn needs to be converted to a string before it is written to a file (see https://github.com/facebookresearch/faiss/issues/914).

So I prepared a workaround with functions for saving and loading DS models that should solve this problem (save_ds, load_ds). In the case, they just check whether faiss is being used for the knn calculation in the DS models and if yes, do the conversions before saving/loading. I added the code in this gist: https://gist.github.com/Menelau/0cde51c3622be6313fd96b4dffb17996 Can you check if using this workaround solves your problem?

Now I will see how to add to DESlib a saving/loading functionality for the DS methods (that can handle Faiss knn automatically) as soon as possible.

sara-eb commented 5 years ago

@Menelau Thank you very much sir, It works perfectly Appreciate it

sara-eb commented 4 years ago

@Menelau I am facing new issue with scoring now on the test set. What could be the reason.

 score = knorae.score(X_test, y_test)
  File "/home/esara/deslib-env/lib/python3.6/site-packages/sklearn/base.py", line 357, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/home/esara/deslib-env/lib/python3.6/site-packages/deslib/base.py", line 440, in predict
    distances, neighbors = self._get_region_competence(X_DS)
  File "/home/esara/deslib-env/lib/python3.6/site-packages/deslib/base.py", line 381, in _get_region_competence
    return_distance=True)
  File "/home/esara/deslib-env/lib/python3.6/site-packages/deslib/util/faiss_knn_wrapper.py", line 112, in kneighbors
    dist, idx = self.index_.search(X, n_neighbors)
AttributeError: 'numpy.ndarray' object has no attribute 'search'
Menelau commented 4 years ago

Hello,

How did you load the ds model? Did you use the load_ds function I provided in the gist: https://gist.github.com/Menelau/0cde51c3622be6313fd96b4dffb17996 ?

I believe the error is in the way you are loading the DS model. In order to save the Faiss model, it's index is converted to a numpy array, so that it can be pickled. In the case, the self.index_ variable is the one containing the indexes, so it is serialized in the save_ds function (by converting to numpy array). Then, in order to load it back the conversion to numpy array back to Faiss index needs to be done (which the load_ds function in the Gist performs).

sara-eb commented 4 years ago

@Menelau Thanks a lot sir, sorry I did not realize that I need to reload the model since the model is already in the variable list in the memory. Thank you very much for mentioning the point.