scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 497 forks source link

RecursionError exception in single_linkage_tree_.plot() #204

Open chryselectrum opened 6 years ago

chryselectrum commented 6 years ago

I'm trying to get more information on the clustering using the single_linkagetree figure. However, I'm getting the error stack trace below on using the single_linkagetree.plot() method. I'm trying to cluster around 200000 data points with 100 features.

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  [Previous line repeated 416 more times]
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2782, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  [Previous line repeated 444 more times]
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2782, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2749, in _dendrogram_calculate_info
    above_threshold_color=above_threshold_color)
  [Previous line repeated 125 more times]
  File "/usr/lib64/python3.6/site-packages/scipy/cluster/hierarchy.py", line 2617, in _dendrogram_calculate_info
    if n == 0:
RecursionError: maximum recursion depth exceeded in comparison

Any help available on how to tackle the problem?

lmcinnes commented 6 years ago

I'm afraid that singe linkage tree plotting simply won't work with that much data. You'll have to use the condensed tree plots instead. Sorry.

chryselectrum commented 6 years ago

Thank you for your quick response. Using the condensedtree.plot() I also ran into problems. This time I get stack overflow:

Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007fada06ec740 (most recent call first):
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 36 in _recurse_leaf_dfs
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in <listcomp>
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in _recurse_leaf_dfs
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in <listcomp>
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in _recurse_leaf_dfs
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in <listcomp>
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in _recurse_leaf_dfs
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in <listcomp>
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in _recurse_leaf_dfs
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in <listcomp>
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in _recurse_leaf_dfs
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in <listcomp>
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in _recurse_leaf_dfs
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in <listcomp>
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in _recurse_leaf_dfs
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in <listcomp>
  File "/usr/lib64/python3.6/site-packages/hdbscan/plots.py", line 40 in _recurse_leaf_dfs
...

Anything that can be done about it?

lmcinnes commented 6 years ago

That would mean that you have too low a min_cluster size to get a sensible plot out. You will need to increase the min_cluster_size parameter to something rather larger. In doing so you may want to set the min_samples parameter explicitly (otherwise it will be set to whatever value you provide min_cluster_size). In this case min_samples can probably be set to whatever value you were originally using for min_cluster_size.

chryselectrum commented 6 years ago

Thank you for your help, I managed to get better understanding of the clustering by increasing the min_cluster_size and keeping min_samples small.

lmcinnes commented 6 years ago

Glad I could help you get something that worked out for now.