kmdouglass / bstore

Lightweight data management and analysis tools for single-molecule fluorescence microscopy.
Other
4 stars 1 forks source link

Slow down caused by DBSCAN #3

Open kmdouglass opened 8 years ago

kmdouglass commented 8 years ago

When the fiducial tracks become very large, DBSCAN starts to consume A LOT of memory (it consumed all 48 GB on my machine earlier). This is with neighbor radius of 500 and minimum number of samples of about 35,000. It's most prevalent for long, consistent tracks that are uninterrupted.

kmdouglass commented 8 years ago

I just repeated this behavior for five fiducial tracks, each with about 20000 frames on the Olympus computer. It filled up all 16gb of memory. And didn't complete.

Three tracks worked, filling only 12 GB and took 10 or 20 seconds.

Perhaps I can implement ELKI's version of DBSCAN.

kmdouglass commented 8 years ago

@nberliner recommended trying HDBSCAN as a high performance implementation.

nberliner commented 8 years ago

I'm not sure about the memory consumption. It apparently is a bit faster than the sklearn DBSCAN implementation (see here). Interestingly, it appears from that comparison as if the sklearn DBSCAN implementation can cluster 200000 points on a laptop with 8GB.

One advantage of HDBSCAN is that it dynamically selects a suitable density for clustering which can vary for each cluster in the field of view. There is only one parameter, the minimum number of clusters, which must be set by the user. I found the description given on the project page very good (see here).

kmdouglass commented 7 years ago

See this discussion: https://github.com/scikit-learn/scikit-learn/issues/5275

Also note discussion on DBSCAN and memory usage here: http://scikit-learn.org/stable/modules/clustering.html#dbscan

It seems that if the neighborhood radius is made too large, then the memory consumption blows up. I noticed this when I recently tried to cluster a dataset that was in units of pixels instead of nanometers. Setting the neighborhood radius to "50" included nearly every point in the radius and ate up all my memory. Resetting it to 0.5 pixels worked without much memory consumption.