Open thclark opened 6 years ago
I believe this is a nearest neighbor computation issue. The prediction data really needs to make use of nearest neighbors, and the easy way to do that is with space trees. These, unfortunately, do not accept sparse input. In practice this could be fixed by allowing computation of the full distance matrix for small enough datasets and getting nearest neighbor data from argsorting that by rows. I'll look into this and see if there is an easy fix possible.
First: Kudos to all the HDBSCAN devs for an utterly gorgeous and extremely powerful library.
TL;DR
HDBSCAN
runs successfully on sparse and dense matrix input, but fails with sparse input where theprediction_data=True
flag is supplied to the clusterer.Background and use case
I've been using
HDBSCAN
to cluster related strings of text. My test case matrix sizeX.shape
is(8535 x 1820)
, with a highly sparseX
, so this is presently solvable as either a dense or sparse matrix.The following code, executed in serial on my laptop...
...runs in 22 seconds on a sparse matrix and 181 seconds (+time to construct dense) on a dense one... so there's a clear use case for sparse inputs. Additionally, the memory requirement for dense
X
scales as O(N^2) so dense solution is not scalable to large datasets.Reproduction
The resulting stack trace is as follows:
Related
This may be related to #108 which notes a problem with different input data type- I can't tell, as there's no stack trace given there so not sure if they have the same issue.