scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

RandomizedSearchCV: All estimators failed to fit hdbscan #557

Open Jalanjii opened 2 years ago

Jalanjii commented 2 years ago

I have done clustering using hdbscan, everything is working. I wanted to do evaluation/validation of the clusters now with hyperparameter tuning with the following code: The matrix passed is a dissimilarity matrix already computed with a metric not present in HDBSCAN, that is the reason why we have it precomputed.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer
import logging

logging.captureWarnings(True)
model2 = hdbscan.HDBSCAN(metric='precomputed').fit(mp_matrix)

param_dist = {'min_samples': [1,2,3],#,14,21,28,35,70],
              'min_cluster_size':[2,3,4],#,5,6,7,8,9,10,11,12,13,14],  
              'cluster_selection_method' : ['eom','leaf'] 
             }

validity_scorer = make_scorer(hdbscan.validity.validity_index, greater_is_better=True)

SEED = 42
n_iter_search = 20
random_search = RandomizedSearchCV(model2, param_distributions=param_dist, n_iter=n_iter_search, scoring=validity_scorer, random_state=SEED)
random_search.fit(mp_matrix)

print(f"Best Parameters {random_search.best_params_}")

and I get the following traceback

---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
Input In [78], in <cell line: 18>()
     16 n_iter_search = 20
     17 random_search = RandomizedSearchCV(model2, param_distributions=param_dist, n_iter=n_iter_search, scoring=validity_scorer, random_state=SEED)
---> 18 random_search.fit(mp_matrix)

File D:\Other\Apps\Anaconda\Lib\site-packages\sklearn\model_selection\_search.py:891, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    885     results = self._format_results(
    886         all_candidate_params, n_splits, all_out, all_more_results
    887     )
    889     return results
--> 891 self._run_search(evaluate_candidates)
    893 # multimetric is determined here because in the case of a callable
    894 # self.scoring the return type is only known after calling
    895 first_test_score = all_out[0]["test_scores"]

File D:\Other\Apps\Anaconda\Lib\site-packages\sklearn\model_selection\_search.py:1766, in RandomizedSearchCV._run_search(self, evaluate_candidates)
   1764 def _run_search(self, evaluate_candidates):
   1765     """Search n_iter candidates from param_distributions"""
-> 1766     evaluate_candidates(
   1767         ParameterSampler(
   1768             self.param_distributions, self.n_iter, random_state=self.random_state
   1769         )
   1770     )

File D:\Other\Apps\Anaconda\Lib\site-packages\sklearn\model_selection\_search.py:875, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    870 # For callable self.scoring, the return type is only know after
    871 # calling. If the return type is a dictionary, the error scores
    872 # can now be inserted with the correct key. The type checking
    873 # of out will be done in `_insert_error_scores`.
    874 if callable(self.scoring):
--> 875     _insert_error_scores(out, self.error_score)
    877 all_candidate_params.extend(candidate_params)
    878 all_out.extend(out)

File D:\Other\Apps\Anaconda\Lib\site-packages\sklearn\model_selection\_validation.py:331, in _insert_error_scores(results, error_score)
    328         successful_score = result["test_scores"]
    330 if successful_score is None:
--> 331     raise NotFittedError("All estimators failed to fit")
    333 if isinstance(successful_score, dict):
    334     formatted_error = {name: error_score for name in successful_score}

NotFittedError: All estimators failed to fit

and I don't have a single idea where the problem lies, can you help please?

Note: When I remove the metric='precomputed' and make instead gen_min_span_tree=True which works metrics other than those computed manually, I get no problem. So why is that, and how to make the code work for the already computed metric.