Incorrect single linkage trees when using Boruvka ball trees for data with duplicate entries

The borvuka_balltree algorithm produces incorrect single linkage trees for some data sets with duplicate entries. The incorrect single linkage trees result in incorrect clusterings for these data sets.

It appears that the issue is mostly caused by a typo in BallTreeBoruvkaAlgorithm._compute_bounds().

As a note, I have been able to replicate the issue on two linux machines and failed to replicate the issue on a MacOS machine.

The issue can be seen using the following code snippet on the provided sample data (boruvka_testing_data.tar.gz) with duplicate entries:

import pandas as pd
import hdbscan

data = pd.read_csv('boruvka_testing_data.csv', index_col=0)

metric='euclidean'
min_cluster_size=50

clusterer = hdbscan.HDBSCAN(metric=metric, algorithm='generic', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_prims = hdbscan.HDBSCAN(metric=metric, algorithm='prims_balltree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_boruvka = hdbscan.HDBSCAN(metric=metric, algorithm='boruvka_balltree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_prims_kdtree = hdbscan.HDBSCAN(metric=metric, algorithm='prims_kdtree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_boruvka_kdtree = hdbscan.HDBSCAN(metric=metric, algorithm='boruvka_kdtree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)

clusterer.fit(data)
clusterer_prims.fit(data)
clusterer_boruvka.fit(data)
clusterer_prims_kdtree.fit(data)
clusterer_boruvka_kdtree.fit(data)

print('Generic:', sum(clusterer.single_linkage_tree_.to_pandas()['distance']))
print('Prims balltree:', sum(clusterer_prims.single_linkage_tree_.to_pandas()['distance']))
print('Boruvka balltree:', sum(clusterer_boruvka.single_linkage_tree_.to_pandas()['distance']))
print('Prims kdtree:', sum(clusterer_prims_kdtree.single_linkage_tree_.to_pandas()['distance']))
print('Borvuka kdtree:', sum(clusterer_boruvka_kdtree.single_linkage_tree_.to_pandas()['distance']))

Using code from master produces the following output:

Generic: 913166.8450209984
Prims balltree: 900935.9500860934
Boruvka balltree: 1675822.769750423
Prims kdtree: 900935.9500860934
Borvuka kdtree: 913166.8450209984

The sum of distances in the single_linkage tree for the Borvuka ball tree algorithm is almost twice as large as the correct value.

Using the linked pull request #394 (which also incorporates the fix for #393) produces:

Generic: 913166.8450209984
Prims balltree: 913166.8450209984
Boruvka balltree: 913166.8450209984
Prims kdtree: 913166.8450209984
Borvuka kdtree: 913166.8450209984

As a note, while this pull request produces a dramatic improvement in the results, it does not completely fix the problem. Running the same code with min_cluster_size=5 and the linked pull request, produces the output,

Generic: 218923.7797794972
Prims balltree: 218923.7797794972
Boruvka balltree: 219679.3560692969
Prims kdtree: 218923.7797794972
Borvuka kdtree: 219072.49151237102

which contains discrepancies for both the boruvka_balltree and boruvka_kdtree algorithms. These discrepancies are on the order of 0.5% and should have a much smaller effect on the clustering.

scikit-learn-contrib / hdbscan

Incorrect single linkage trees when using Boruvka ball trees for data with duplicate entries #404