jenniferjang commented 4 years ago

Describe the workflow you want to enable

Currently, DBSCAN is very slow for large datasets and can use a lot of memory, especially in higher dimensions.

For example, running sklearn.cluster.DBSCAN on a size 745k dataset with 411 features took over 6 days to finish running on a powerful cloud machine, and running on a size ~1M dataset with a reasonable setting of epsilon for that dataset ran out of memory on that machine with over 750GB RAM.

In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory based on a minor tweak to the DBSCAN algorithm without sacrificing clustering accuracy.

[1] https://arxiv.org/abs/2006.06743

Describe your proposed solution

The solution is based on a simple modification to the brute force version. Instead of calculating the pairwise distances for all pairs of examples to obtain the nearest-neighbor graph, you can instead only calculate it for a fraction s of the pairs (selected randomly) and clusters based on that subsampled graph. This would make the algorithm feasible for larger datasets, without hurting the clustering quality (in some cases s = 0.001 suffices). The idea is that you don’t actually need the full nearest-neighbor graph to get the desired clustering.

All comparisons in [1] were done against the algorithm=’auto’ setting of sklearn.cluster.DBSCAN.

The only new parameter needed would be 0 < s <= 1, the sampling ratio which would only be applicable for the algorithm=’brute-force’ setting. (Perhaps we might also want a random seed parameter if one wants to be able to have reproducible runs).

Describe alternatives you've considered, if relevant

Additional context

jnothman commented 4 years ago

We have already for several years supported downsampling the data and applying sample_weight, or constricting an approximate neighbourhoods matrix and passing it in precomputed. Your solution would be a more explicit interface and likely much more user friendly. It would not allow us to reuse our paired distances or nearest neighbours code, however.

jenniferjang commented 4 years ago

Thanks for the speedy response!

One way to implement this is to have a function that takes in the feature matrix, eps, and sampling rate and outputs a sparse subsampled eps-neighborhood matrix. This function can then be hooked up to DBSCAN's precomputed codepath.

Even if this function is not integrated into DBSCAN, we could mention in the documentation that such an option is available. It would open up use cases for DBSCAN that were previously unavailable for larger datasets.

Computing a subsampled epsilon-neighborhood matrix is costly in pure Python but quick in Cython (as a built-in sklearn function or integrated in DBSCAN) and would make the experience better for the user. It should be simple to implement and hopefully unlikely to have high maintenance cost and would be useful for anyone looking to use DBSCAN on large datasets.

Let me know if any of this sounds unreasonable.

jnothman commented 4 years ago

It sounds reasonable to me. I'd even consider including the subsampling direct in dbscan, although the user would benefit from be being able to diagnose and reuse it. Really, subsampled pairs is just a naive approximate nearest neighbours approach.

And I'm not sure this needs implementation in cython for efficiency: our paired_distances function goes most of the way, but only for a handful of metrics... and yes might be memory inefficient in this case.

My intuition is that you would need a fairly large s if your clusters are many in number or not nearly Gaussian.

jenniferjang commented 4 years ago

Thanks for your response. Regarding the intuition about s: we found that on large (>1M points) datasets, we could choose s as small as 0.001 to 0.02 and still have comparable or better performance to DBSCAN. As for smaller datasets, we were able to use s between 0.01 and 0.3. One reason why we're both theoretically and empirically able to use a small s is that DBSCAN finds the connected components in the neighborhood graph of samples in high density regions, and in those high density regions, there is high connectivity in the graph. The higher the connectivity, the more edges we can subsample while still keeping the components connected.

I can start working on implementing the subsampling directly in DBSCAN as a hyperparameter. Depending on what you guys think, we can move the subsampling to a separate function that outputs the sparse matrix which can be consumed by DBSCAN via the precomputed option.

Also happy to discuss further.

thomasjpfan commented 4 years ago

Having a separate approximate nearest neighbors transformer may be useful for other estimators such as TSNE. An example of using annoy for approximate nearest neighbors can be seen in exmples/neighbors/approximate_nearest_neighbors.py . Furthermore, It could also benefit hdbscan since they also accept precomputed metrics.

jenniferjang commented 4 years ago

This sounds good. Breaking down into two steps:

Implement subsampled nearest neighbors function as a separate function
Integrate into DBSCAN

I’ll start working on the first step and we can revisit the second step afterwards.

Thanks!

jnothman commented 4 years ago

sklearn.neighbors.SubsampledNeighborsTransformer? Sounds interesting!

Hemantr05 commented 1 year ago

Hi, I face a similar issue - DBSCAN is too slow for large datasets (~700k), even on a high compute machine. Just wondered if the solution referred by @jenniferjang has been integrated with scikit-learn package yet?