prediction_data flag fails if used with input sparse matrix

First: Kudos to all the HDBSCAN devs for an utterly gorgeous and extremely powerful library.

TL;DR

HDBSCAN runs successfully on sparse and dense matrix input, but fails with sparse input where the prediction_data=True flag is supplied to the clusterer.

Background and use case

I've been using HDBSCAN to cluster related strings of text. My test case matrix sizeX.shape is (8535 x 1820), with a highly sparse X, so this is presently solvable as either a dense or sparse matrix.

The following code, executed in serial on my laptop...

start_time = time.time()
clusterer = HDBSCAN()
clusterer.fit(X)
end_time = time.time()
logging.debug('Computed clustering in %f seconds', end_time - start_time)

...runs in 22 seconds on a sparse matrix and 181 seconds (+time to construct dense) on a dense one... so there's a clear use case for sparse inputs. Additionally, the memory requirement for dense X scales as O(N^2) so dense solution is not scalable to large datasets.

Reproduction


# Construct a large sparse X
# X.__class__.__name__ should be 'sparse.csr_matrix'
X = build_your_X()

logging.debug('Computing HDBSCAN clustering for %d files, with %d features in X', *X.shape)

# Compute HDBSCAN on the input sparse
clusterer = HDBSCAN()
clusterer.fit(X) # Works successfully

# Compute HDBSCAN on the input dense, with prediction data
clusterer_pd_dense = HDBSCAN(prediction_data=True)
X_dense = X.to_dense()
clusterer_pd_dense.fit(X_dense) # Works successfully

# Compute HDBSCAN on the input sparse, with prediction data
clusterer_pd_sparse= HDBSCAN(prediction_data=True)
clusterer_pd_sparse.fit(X) # !!! FAILS !!!

The resulting stack trace is as follows:

DEBUG:root:Computing HDBSCAN clustering for 8535 files, with 1820 features in X Traceback (most recent call last): File "anon_path/scrap.py", line 133, in clusterer_pd_sparse.fit(X) File "anonpath/venv/lib/python3.5/site-packages/hdbscan/hdbscan.py", line 819, in fit self.generate_prediction_data() File "anonpath/venv/lib/python3.5/site-packages/hdbscan/hdbscan.py", line 861, in generate_prediction_data self._metric_kwargs File "anon_path/venv/lib/python3.5/site-packages/hdbscan/prediction.py", line 102, in init metric=metric, kwargs) File "sklearn/neighbors/binary_tree.pxi", line 1054, in sklearn.neighbors.kd_tree.BinaryTree.init File "anon_path/venv/lib/python3.5/site-packages/numpy/core/numeric.py", line 492, in asarray return array(a, dtype, copy=False, order=order) ValueError: setting an array element with a sequence. Process finished with exit code 1

This may be related to #108 which notes a problem with different input data type- I can't tell, as there's no stack trace given there so not sure if they have the same issue.

scikit-learn-contrib / hdbscan