scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 504 forks source link

prediction_data flag fails if used with input sparse matrix #174

Open thclark opened 6 years ago

thclark commented 6 years ago

First: Kudos to all the HDBSCAN devs for an utterly gorgeous and extremely powerful library.

TL;DR

HDBSCAN runs successfully on sparse and dense matrix input, but fails with sparse input where the prediction_data=True flag is supplied to the clusterer.

Background and use case

I've been using HDBSCAN to cluster related strings of text. My test case matrix sizeX.shape is (8535 x 1820), with a highly sparse X, so this is presently solvable as either a dense or sparse matrix.

The following code, executed in serial on my laptop...

start_time = time.time()
clusterer = HDBSCAN()
clusterer.fit(X)
end_time = time.time()
logging.debug('Computed clustering in %f seconds', end_time - start_time)

...runs in 22 seconds on a sparse matrix and 181 seconds (+time to construct dense) on a dense one... so there's a clear use case for sparse inputs. Additionally, the memory requirement for dense X scales as O(N^2) so dense solution is not scalable to large datasets.

Reproduction


# Construct a large sparse X
# X.__class__.__name__ should be 'sparse.csr_matrix'
X = build_your_X()

logging.debug('Computing HDBSCAN clustering for %d files, with %d features in X', *X.shape)

# Compute HDBSCAN on the input sparse
clusterer = HDBSCAN()
clusterer.fit(X) # Works successfully

# Compute HDBSCAN on the input dense, with prediction data
clusterer_pd_dense = HDBSCAN(prediction_data=True)
X_dense = X.to_dense()
clusterer_pd_dense.fit(X_dense) # Works successfully

# Compute HDBSCAN on the input sparse, with prediction data
clusterer_pd_sparse= HDBSCAN(prediction_data=True)
clusterer_pd_sparse.fit(X) # !!! FAILS !!!

The resulting stack trace is as follows:

DEBUG:root:Computing HDBSCAN clustering for 8535 files, with 1820 features in X Traceback (most recent call last): File "anon_path/scrap.py", line 133, in clusterer_pd_sparse.fit(X) File "anonpath/venv/lib/python3.5/site-packages/hdbscan/hdbscan.py", line 819, in fit self.generate_prediction_data() File "anonpath/venv/lib/python3.5/site-packages/hdbscan/hdbscan.py", line 861, in generate_prediction_data self._metric_kwargs File "anon_path/venv/lib/python3.5/site-packages/hdbscan/prediction.py", line 102, in init metric=metric, kwargs) File "sklearn/neighbors/binary_tree.pxi", line 1054, in sklearn.neighbors.kd_tree.BinaryTree.init File "anon_path/venv/lib/python3.5/site-packages/numpy/core/numeric.py", line 492, in asarray return array(a, dtype, copy=False, order=order) ValueError: setting an array element with a sequence. Process finished with exit code 1

Related

This may be related to #108 which notes a problem with different input data type- I can't tell, as there's no stack trace given there so not sure if they have the same issue.

lmcinnes commented 6 years ago

I believe this is a nearest neighbor computation issue. The prediction data really needs to make use of nearest neighbors, and the easy way to do that is with space trees. These, unfortunately, do not accept sparse input. In practice this could be fixed by allowing computation of the full distance matrix for small enough datasets and getting nearest neighbor data from argsorting that by rows. I'll look into this and see if there is an easy fix possible.