scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

Incorrect single linkage trees when using Boruvka ball trees for data with duplicate entries #404

Open GregDemand opened 4 years ago

GregDemand commented 4 years ago

The borvuka_balltree algorithm produces incorrect single linkage trees for some data sets with duplicate entries. The incorrect single linkage trees result in incorrect clusterings for these data sets.

It appears that the issue is mostly caused by a typo in BallTreeBoruvkaAlgorithm._compute_bounds().

As a note, I have been able to replicate the issue on two linux machines and failed to replicate the issue on a MacOS machine.

The issue can be seen using the following code snippet on the provided sample data (boruvka_testing_data.tar.gz) with duplicate entries:

import pandas as pd
import hdbscan

data = pd.read_csv('boruvka_testing_data.csv', index_col=0)

metric='euclidean'
min_cluster_size=50

clusterer = hdbscan.HDBSCAN(metric=metric, algorithm='generic', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_prims = hdbscan.HDBSCAN(metric=metric, algorithm='prims_balltree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_boruvka = hdbscan.HDBSCAN(metric=metric, algorithm='boruvka_balltree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_prims_kdtree = hdbscan.HDBSCAN(metric=metric, algorithm='prims_kdtree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_boruvka_kdtree = hdbscan.HDBSCAN(metric=metric, algorithm='boruvka_kdtree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)

clusterer.fit(data)
clusterer_prims.fit(data)
clusterer_boruvka.fit(data)
clusterer_prims_kdtree.fit(data)
clusterer_boruvka_kdtree.fit(data)

print('Generic:', sum(clusterer.single_linkage_tree_.to_pandas()['distance']))
print('Prims balltree:', sum(clusterer_prims.single_linkage_tree_.to_pandas()['distance']))
print('Boruvka balltree:', sum(clusterer_boruvka.single_linkage_tree_.to_pandas()['distance']))
print('Prims kdtree:', sum(clusterer_prims_kdtree.single_linkage_tree_.to_pandas()['distance']))
print('Borvuka kdtree:', sum(clusterer_boruvka_kdtree.single_linkage_tree_.to_pandas()['distance']))

Using code from master produces the following output:

Generic: 913166.8450209984
Prims balltree: 900935.9500860934
Boruvka balltree: 1675822.769750423
Prims kdtree: 900935.9500860934
Borvuka kdtree: 913166.8450209984 

The sum of distances in the single_linkage tree for the Borvuka ball tree algorithm is almost twice as large as the correct value.

Using the linked pull request #394 (which also incorporates the fix for #393) produces:

Generic: 913166.8450209984
Prims balltree: 913166.8450209984
Boruvka balltree: 913166.8450209984
Prims kdtree: 913166.8450209984
Borvuka kdtree: 913166.8450209984

As a note, while this pull request produces a dramatic improvement in the results, it does not completely fix the problem. Running the same code with min_cluster_size=5 and the linked pull request, produces the output,

Generic: 218923.7797794972
Prims balltree: 218923.7797794972
Boruvka balltree: 219679.3560692969
Prims kdtree: 218923.7797794972
Borvuka kdtree: 219072.49151237102

which contains discrepancies for both the boruvka_balltree and boruvka_kdtree algorithms. These discrepancies are on the order of 0.5% and should have a much smaller effect on the clustering.

GregDemand commented 4 years ago

I've tracked down and fixed the discrepancy for the boruvka_kdtree algorithm and isolated the issue for the boruvka_balltree.

For the boruvka_kdtree algorithm, the discrepancy was caused by adding a reduced distance to the non-reduced distance when calculating the new_lower_bound. I have corrected this in my pull request.

For the boruvka_kdtree algorithm, the issue is also in the new_lower_bound, but I'm unsure of precisely what the issue is. If only the new_upper_bound is used, i.e. replace

new_bound = min(new_upper_bound, new_lower_bound + 2 * node1_info.radius) with new_bound=new_upper_bound,

I get the correct answer, but the algorithm would be less performant.