scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection
BSD 3-Clause "New" or "Revised" License
481 stars 106 forks source link

Fitting DESClustering on X_dsel raise a value error when I have a pool of (pre-trained) Random Forest classifier model #212

Closed sara-eb closed 4 years ago

sara-eb commented 4 years ago

I have trained the random forest on X_train in advance and load the model to create the pool of classifier:

model_rf = load(base_model_dir + 'model_rf800.joblib')
pool_classifiers = [model_rf]  

It seems that other DS methods (i.e., OLA, MLA, DESP, and etc) are successfully fitting the X_dsel on the DS models, however DESClustering is raising the value error:

print("Fitting DES-Clustering on X_DSEL dataset")
kmeans = KMeans(n_clusters=80, random_state = rng)
desclustering = DESClustering(pool_classifiers=pool_classifiers, random_state = rng, clustering = kmeans)

which is raised on this line

ValueError                                Traceback (most recent call last)
<ipython-input-10-6583b8d75519> in <module>
     11 desclustering = DESClustering(pool_classifiers=pool_classifiers, random_state = rng, clustering = kmeans)
     12 
---> 13 desclustering.fit(X_dsel, y_dsel)
     14 end = time.clock()
     15 print(" DES-Clustering fitting time for 5 patients in DSEL = {}".format(end))

~/deslib-env/lib/python3.6/site-packages/deslib/des/des_clustering.py in fit(self, X, y)
    132         self.J_ = int(np.ceil(self.n_classifiers_ * self.pct_diversity))
    133 
--> 134         self._check_parameters()
    135 
    136         if self.clustering is None:

~/deslib-env/lib/python3.6/site-packages/deslib/des/des_clustering.py in _check_parameters(self)
    374         if self.N_ <= 0 or self.J_ <= 0:
    375             raise ValueError("The values of N_ and J_ should be higher than 0"
--> 376                              "N_ = {}, J_= {} ".format(self.N_, self.J_))
    377         if self.N_ < self.J_:
    378             raise ValueError(

ValueError: The values of N_ and J_ should be higher than 0N_ = 0, J_= 1 
  1. Why this is happening when a Random Forest is given as a pool to DESClustering?

  2. Is it because I have a pre-trained RF model and loading it as the pool? Is there any difference between loading a pre-trained classifier and training the pool with RF model (as it can be seen here

    pool_classifiers = BaggingClassifier(Perceptron(max_iter=100), random_state=rng)
    pool_classifiers.fit(X_train, y_train)

Your expert opinion is really appreciated. Thanks

Menelau commented 4 years ago

@sara-eb Hello,

Thanks for reporting this issue. I believe the problem is that you are putting the RF classifier inside of a list:

pool_classifiers = [model_rf]

So when the DES clustering receives the models as input it only sees one model (as the list has only one element), instead of seeing all individual models of the RF. For that reason, it cannot properly set up the variables N and J which corresponds to the number of classifiers in the pool selected based on accuracy and diversity, respectively (they are based on a fraction of the total pool size).

Try changing that line to just:

pool_classifier = model_rf

to see if it works.

sara-eb commented 4 years ago

Dear @Menelau Thanks a lot for your prompt response. I have tried this before posting the issue here,

Once I changed to pool_classifier = model_rf, DESClustering on X_dsel takes long time, even after 24 hours does not fit, which seems there is a problem here. Then, I had to stop the code, because it is not fitting.

Moreover, since I am using the Faiss kNN, I am using this code for saving the model. Once I remove the brackets, the problem raises for other DS models. For example in the case of OLA:

Fitting OLA on X_DSEL dataset
 OLA fitting time for 5 patients in DSEL = 455.3
Saving the OLA dynamic selection model

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-19-8568481c02e9> in <module>
     12 print("Saving the OLA dynamic selection model")
     13 ola_model_dir = ds_model_outdir+'ola.pkl'
---> 14 save_ds(model_ola, ola_model_dir)
     15 
     16 

<ipython-input-16-0f2979a7cded> in save_ds(dsalgo, path)
     28         dsalgo.roc_algorithm_.index_ = serialize_index(dsalgo.roc_algorithm_.index_)
     29     with open(path, 'wb') as f:
---> 30         dill.dump(dsalgo, f)
     31 
     32 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in dump(obj, file, protocol, byref, fmode, recurse, **kwds)
    257     _kwds = kwds.copy()
    258     _kwds.update(dict(byref=byref, fmode=fmode, recurse=recurse))
--> 259     Pickler(file, protocol, **_kwds).dump(obj)
    260     return
    261 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in dump(self, obj)
    443             raise PicklingError(msg)
    444         else:
--> 445             StockPickler.dump(self, obj)
    446         stack.clear()  # clear record of 'recursion-sensitive' pickled objects
    447         return

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in dump(self, obj)
    407         if self.proto >= 4:
    408             self.framer.start_framing()
--> 409         self.save(obj)
    410         self.write(STOP)
    411         self.framer.end_framing()

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    519 
    520         # Save the reduce() output and finally memoize the object
--> 521         self.save_reduce(obj=obj, *rv)
    522 
    523     def persistent_id(self, obj):

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    632 
    633         if state is not None:
--> 634             save(state)
    635             write(BUILD)
    636 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in save_module_dict(pickler, obj)
    910             # we only care about session the first pass thru
    911             pickler._session = False
--> 912         StockPickler.save_dict(pickler, obj)
    913         log.info("# D2")
    914     return

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_dict(self, obj)
    819 
    820         self.memoize(obj)
--> 821         self._batch_setitems(obj.items())
    822 
    823     dispatch[dict] = save_dict

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in _batch_setitems(self, items)
    845                 for k, v in tmp:
    846                     save(k)
--> 847                     save(v)
    848                 write(SETITEMS)
    849             elif n:

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    519 
    520         # Save the reduce() output and finally memoize the object
--> 521         self.save_reduce(obj=obj, *rv)
    522 
    523     def persistent_id(self, obj):

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    632 
    633         if state is not None:
--> 634             save(state)
    635             write(BUILD)
    636 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_tuple(self, obj)
    749         write(MARK)
    750         for element in obj:
--> 751             save(element)
    752 
    753         if id(obj) in memo:

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_bytes(self, obj)
    699             self.write(BINBYTES8 + pack("<Q", n) + obj)
    700         else:
--> 701             self.write(BINBYTES + pack("<I", n) + obj)
    702         self.memoize(obj)
    703     dispatch[bytes] = save_bytes

error: 'I' format requires 0 <= number <= 4294967295
sara-eb commented 4 years ago

Dear @Menelau Thanks a lot for your prompt response. I have tried this before posting the issue here,

Once I changed to pool_classifier = model_rf, DESClustering on X_dsel takes long time, even after 24 hours does not fit, which seems there is a problem here. Then, I had to stop the code, because it is not fitting.

Moreover, since I am using the Faiss kNN, I am using this code for saving the model. Once I remove the brackets, the problem raises for other DS models. For example in the case of OLA:

Fitting OLA on X_DSEL dataset
 OLA fitting time for 5 patients in DSEL = 455.3
Saving the OLA dynamic selection model

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-19-8568481c02e9> in <module>
     12 print("Saving the OLA dynamic selection model")
     13 ola_model_dir = ds_model_outdir+'ola.pkl'
---> 14 save_ds(model_ola, ola_model_dir)
     15 
     16 

<ipython-input-16-0f2979a7cded> in save_ds(dsalgo, path)
     28         dsalgo.roc_algorithm_.index_ = serialize_index(dsalgo.roc_algorithm_.index_)
     29     with open(path, 'wb') as f:
---> 30         dill.dump(dsalgo, f)
     31 
     32 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in dump(obj, file, protocol, byref, fmode, recurse, **kwds)
    257     _kwds = kwds.copy()
    258     _kwds.update(dict(byref=byref, fmode=fmode, recurse=recurse))
--> 259     Pickler(file, protocol, **_kwds).dump(obj)
    260     return
    261 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in dump(self, obj)
    443             raise PicklingError(msg)
    444         else:
--> 445             StockPickler.dump(self, obj)
    446         stack.clear()  # clear record of 'recursion-sensitive' pickled objects
    447         return

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in dump(self, obj)
    407         if self.proto >= 4:
    408             self.framer.start_framing()
--> 409         self.save(obj)
    410         self.write(STOP)
    411         self.framer.end_framing()

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    519 
    520         # Save the reduce() output and finally memoize the object
--> 521         self.save_reduce(obj=obj, *rv)
    522 
    523     def persistent_id(self, obj):

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    632 
    633         if state is not None:
--> 634             save(state)
    635             write(BUILD)
    636 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in save_module_dict(pickler, obj)
    910             # we only care about session the first pass thru
    911             pickler._session = False
--> 912         StockPickler.save_dict(pickler, obj)
    913         log.info("# D2")
    914     return

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_dict(self, obj)
    819 
    820         self.memoize(obj)
--> 821         self._batch_setitems(obj.items())
    822 
    823     dispatch[dict] = save_dict

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in _batch_setitems(self, items)
    845                 for k, v in tmp:
    846                     save(k)
--> 847                     save(v)
    848                 write(SETITEMS)
    849             elif n:

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    519 
    520         # Save the reduce() output and finally memoize the object
--> 521         self.save_reduce(obj=obj, *rv)
    522 
    523     def persistent_id(self, obj):

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    632 
    633         if state is not None:
--> 634             save(state)
    635             write(BUILD)
    636 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_tuple(self, obj)
    749         write(MARK)
    750         for element in obj:
--> 751             save(element)
    752 
    753         if id(obj) in memo:

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_bytes(self, obj)
    699             self.write(BINBYTES8 + pack("<Q", n) + obj)
    700         else:
--> 701             self.write(BINBYTES + pack("<I", n) + obj)
    702         self.memoize(obj)
    703     dispatch[bytes] = save_bytes

error: 'I' format requires 0 <= number <= 4294967295
sara-eb commented 4 years ago

Dear @Menelau Thanks a lot for your prompt response. I have tried this before posting the issue here,

Once I changed to pool_classifier = model_rf, DESClustering on X_dsel takes long time, even after 24 hours does not fit, which seems there is a problem here. Then, I had to stop the code, because it is not fitting.

Moreover, since I am using the Faiss kNN, I am using this code for saving the model. Once I remove the brackets, the problem raises for other DS models. For example in the case of OLA:

Fitting OLA on X_DSEL dataset
 OLA fitting time for 5 patients in DSEL = 455.3
Saving the OLA dynamic selection model

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-19-8568481c02e9> in <module>
     12 print("Saving the OLA dynamic selection model")
     13 ola_model_dir = ds_model_outdir+'ola.pkl'
---> 14 save_ds(model_ola, ola_model_dir)
     15 
     16 

<ipython-input-16-0f2979a7cded> in save_ds(dsalgo, path)
     28         dsalgo.roc_algorithm_.index_ = serialize_index(dsalgo.roc_algorithm_.index_)
     29     with open(path, 'wb') as f:
---> 30         dill.dump(dsalgo, f)
     31 
     32 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in dump(obj, file, protocol, byref, fmode, recurse, **kwds)
    257     _kwds = kwds.copy()
    258     _kwds.update(dict(byref=byref, fmode=fmode, recurse=recurse))
--> 259     Pickler(file, protocol, **_kwds).dump(obj)
    260     return
    261 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in dump(self, obj)
    443             raise PicklingError(msg)
    444         else:
--> 445             StockPickler.dump(self, obj)
    446         stack.clear()  # clear record of 'recursion-sensitive' pickled objects
    447         return

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in dump(self, obj)
    407         if self.proto >= 4:
    408             self.framer.start_framing()
--> 409         self.save(obj)
    410         self.write(STOP)
    411         self.framer.end_framing()

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    519 
    520         # Save the reduce() output and finally memoize the object
--> 521         self.save_reduce(obj=obj, *rv)
    522 
    523     def persistent_id(self, obj):

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    632 
    633         if state is not None:
--> 634             save(state)
    635             write(BUILD)
    636 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

~/deslib-env/lib/python3.6/site-packages/dill/_dill.py in save_module_dict(pickler, obj)
    910             # we only care about session the first pass thru
    911             pickler._session = False
--> 912         StockPickler.save_dict(pickler, obj)
    913         log.info("# D2")
    914     return

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_dict(self, obj)
    819 
    820         self.memoize(obj)
--> 821         self._batch_setitems(obj.items())
    822 
    823     dispatch[dict] = save_dict

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in _batch_setitems(self, items)
    845                 for k, v in tmp:
    846                     save(k)
--> 847                     save(v)
    848                 write(SETITEMS)
    849             elif n:

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    519 
    520         # Save the reduce() output and finally memoize the object
--> 521         self.save_reduce(obj=obj, *rv)
    522 
    523     def persistent_id(self, obj):

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    632 
    633         if state is not None:
--> 634             save(state)
    635             write(BUILD)
    636 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_tuple(self, obj)
    749         write(MARK)
    750         for element in obj:
--> 751             save(element)
    752 
    753         if id(obj) in memo:

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

/usr/local/python/3.6.2-static/lib/python3.6/pickle.py in save_bytes(self, obj)
    699             self.write(BINBYTES8 + pack("<Q", n) + obj)
    700         else:
--> 701             self.write(BINBYTES + pack("<I", n) + obj)
    702         self.memoize(obj)
    703     dispatch[bytes] = save_bytes

error: 'I' format requires 0 <= number <= 4294967295
sara-eb commented 4 years ago

@Menelau Hi again, Any update on this? it seems the model is big and cannot be saved by dill.dump(dsalgo, f).

I tried to find a solution for saving i. In this link, the author is suggesting the author is suggesting to save as HDF5 file.

I am not sure if I have written the command it in a correct way or not, but I was trying to save the model `ola' as hdf

    from klepto.archives import *
    file_archive('model_la.pkl',ola,serialized=True)

it raise an error:

```

~/my-env/lib/python3.6/site-packages/klepto/archives.py in new(file_archive, name, dict, cached, kwds) 118 archive = _file_archive(name, kwds) 119 if cached: archive = cache(archive=archive) --> 120 archive.update(dict) 121 return archive 122

TypeError: 'OLA' object is not iterable


Do you have any idea? how can I save these big models on big datasets?
sara-eb commented 4 years ago

@Menelau Hi again, Any update on this? it seems the model is big and cannot be saved by dill.dump(dsalgo, f).

I tried to find a solution for saving i. In this link, the author is suggesting the author is suggesting to save as HDF5 file.

I am not sure if I have written the command it in a correct way or not, but I was trying to save the model `ola' as hdf

    from klepto.archives import *
    file_archive('model_la.pkl',ola,serialized=True)

it raise an error:

```

~/my-env/lib/python3.6/site-packages/klepto/archives.py in new(file_archive, name, dict, cached, kwds) 118 archive = _file_archive(name, kwds) 119 if cached: archive = cache(archive=archive) --> 120 archive.update(dict) 121 return archive 122

TypeError: 'OLA' object is not iterable


Do you have any idea? how can I save these big models on big datasets?
sara-eb commented 4 years ago

@Menelau Hi again, Any update on this? it seems the model is big and cannot be saved by dill.dump(dsalgo, f).

I tried to find a solution for saving i. In this link, the author is suggesting the author is suggesting to save as HDF5 file.

I am not sure if I have written the command it in a correct way or not, but I was trying to save the model `ola' as hdf

    from klepto.archives import *
    file_archive('model_la.pkl',ola,serialized=True)

it raise an error:

```

~/my-env/lib/python3.6/site-packages/klepto/archives.py in new(file_archive, name, dict, cached, kwds) 118 archive = _file_archive(name, kwds) 119 if cached: archive = cache(archive=archive) --> 120 archive.update(dict) 121 return archive 122

TypeError: 'OLA' object is not iterable


Do you have any idea? how can I save these big models on big datasets?
Menelau commented 4 years ago

Hello,

If the model is too big I believe that hdf5 is a better option since it is made for large and complex data format. However, I'm not familiar with klepto and the whole HDF5 saving process. I think your best bet would be checking with the h5py repository the https://github.com/h5py/h5py