scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.74k stars 492 forks source link

Usage with images #3

Closed varadgunjal closed 8 years ago

varadgunjal commented 8 years ago

Tried using the implementation as is for working with images, as was being done in sklearn using KMeans / MiniBatchKMeans / Meanshift clustering. But consistently run into MemoryError (for images as small as 200x200 as well). Here is a sample code -

import cv2
import numpy as np
import hdbscan

image = cv2.imread('/home/ubuntu/x.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = image.reshape((image.shape[0] * image.shape[1], 3))

clusterer = hdbscan.HDBSCAN(min_cluster_size=100)
cluster_labels = clusterer.fit_predict(image)

Error stack trace :

MemoryError                               Traceback (most recent call last)
<ipython-input-12-b887e72bd6d5> in <module>()
----> 1 cluster_labels = clusterer.fit_predict(image)

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in fit_predict(self, X, y)
    338             cluster labels
    339         """
--> 340         self.fit(X)
    341         return self.labels_
    342 

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in fit(self, X, y)
    320          self._condensed_tree,
    321          self._single_linkage_tree,
--> 322          self._min_spanning_tree) = hdbscan(X, **self.get_params())
    323         return self
    324 

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in hdbscan(X, min_cluster_size, min_samples, metric, p, algorithm)
    235     else:
    236         return _hdbscan_large_kdtree(X, min_cluster_size, 
--> 237                                      min_samples, metric, p)
    238 
    239 class HDBSCAN(BaseEstimator, ClusterMixin):

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in _hdbscan_large_kdtree(X, min_cluster_size, min_samples, metric, p)
    107         p = 2
    108 
--> 109     mutual_reachability_ = kdtree_pdist_mutual_reachability(X, metric, p, min_samples)
    110 
    111     min_spanning_tree = mst_linkage_core_pdist(mutual_reachability_)

/home/vg/.python-eggs/hdbscan-0.1-py2.7-linux-x86_64.egg-tmp/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2820)()

/home/vg/.python-eggs/hdbscan-0.1-py2.7-linux-x86_64.egg-tmp/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2432)()

/usr/lib/python2.7/dist-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
   1174 
   1175     m, n = s
-> 1176     dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
   1177 
   1178     wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']

MemoryError:

Any obvious problem with the code? Or is this to be expected?

lmcinnes commented 8 years ago

There's no obvious problem with what you've done. The thing to keep in mind is that even with a 200x200 image you've reshaped to get 40000 points and the algorithm needs to generate an all pairs distance matrix of that (or, at least, the upper triangular portion thereof) which has roughly 800000000 entries. The result is that you're simply running out of RAM to store that array. KMeans and DBSCAN implementations use methods (such as kdtrees) to avoid having to compute and store all those distances, unfortunately the current implementation of HDBSCAN does not support that (and requires some different approaches to make that happen). It is on the theoretical roadmap, but I'm not there yet. If you can run on a box with more RAM it will work; I have successfully clustered larger datasets (up to 128000 points in testing) but I was using a large memory SMP to do it.

varadgunjal commented 8 years ago

Thought as much. Will try on more RAM. Out of curiosity, when and how do you think you will work in the kdtrees implementation?

Another question - any way to access cluster centers? Related KMeans example below with highlights -

import cv2
import numpy as np
from sklearn.cluster import KMeans

image = cv2.imread('/home/ubuntu/x.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = image.reshape((image.shape[0] * image.shape[1], 3))

clt = KMeans(n_clusters = 4)
clt.fit(image)

centers = clt.cluster_centers_ # this
jc-healy commented 8 years ago

Glad you're interested. The clusters themselves don't have centers. We don't make the gaussian ball assumption like kmeans as such the idea of a cluster center would be poorly defined. In many clusters, such as long snake like ones most concepts of a center as a good descriptor of the cluster would be misleading in many cases. Further, in higher dimensions the clusters shapes could be quite difficult to distinguish making this more problematic. What are hoping to use the centers for? I personally prefer to sample points from the cluster itself to get a good feel for what the cluster represents.

Regarding the kdtrees,j we'll probably get to those in the next few months.

Cheers John

On Oct 10, 2015 12:30 PM, "varadgunjal" notifications@github.com wrote:

Thought as much. Will try on more RAM. Out of curiosity, when and how do you think you will work in the kdtrees implementation?

Another question - any way to access cluster centers? Related KMeans example below with highlights -

import cv2 import numpy as np import hdbscan

image = cv2.imread('/home/ubuntu/x. jpg') image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.reshape((image.shape[0] * image.shape[1], 3))

clt = KMeans(n_clusters = 4) clt.fit(image)

centers = clt.clustercenters

— Reply to this email directly or view it on GitHub.

varadgunjal commented 8 years ago

That makes a lot of sense. I was hoping to use the centroids, as you mentioned, to be a descriptor of the cluster. But I agree, sampling points makes more sense and would actually improve what I'm trying to do : I'm trying to get use this clustering to figure out the 'prominent' or most representative colors in an image. So what I was doing earlier was using sklearn's MeanShift to get a good prediction of the number of representative clusters and then KMeans / MiniBatchKMeans to cluster them accordingly. And then using the centroids as the representative colors in the image. However, the averaging out was actually not doing a good job of getting the actual color and was bringing out everything with a greyish tinge.

Would you have any suggestions on how I should sample the cluster I would get, say using HDBSCAN, so as to get good representations?

lmcinnes commented 8 years ago

How to sample clusters really depends upon the cluster size. If you have small enough clusters you can just take all the points; otherwise any sort of uniform random sampling from members of the clusters would be okay. We are looking to implement some semblance of soft clustering in the near future, so you could use that to bias your sample once we get that done.

For what you're doing a reasonable approach may just be to look at the mean and standard deviation of the colours in each cluster. DBSCAN should (and in our experience does) give relatively pure clusters a lot of the time, so presuming the standard deviation is low you can just use the mean colours. The other thing to note is that you probably want to transform the colour space, especially for taking averages, and look into gamma correction and the curves involved. There's a lot of useful colour theory out there on how to 'average' colours properly that can also help avoid the "washed out gray" result.

varadgunjal commented 8 years ago

I did try using DBSCAN, but (correct me if I'm wrong) I found it works better if the density / size of all expected clusters is similar. Which isn't the case in my data - or so I believe, because the results were less than spectacular.

I found MeanShift gave a better estimate of clusters which significantly reduced the graying effect already. And I'm already using the CIELAB color space to be in the clear. But while I wait on your soft clustering implementation, any chance you can refer me to the gamma correction / transformation routines you alluded to? (papers / code)

lmcinnes commented 8 years ago

If you're doing everything in CIELAB then you're probably fine. Soft clustering was just committed -- hopefully it works as expected. Let me know if you run into any issues.

lmcinnes commented 8 years ago

I've just committed changes that support a low memory approach to large datasets. It is not the full work that gets the asymptotic down, but it should allow you to run on large dataset sizes without running into memory errors.

varadgunjal commented 8 years ago

Oh cool. Will check it out and revert with feedback.

varadgunjal commented 8 years ago

So, I tested with the same code as before. Used a 300x300 image. Still run into a MemoryError.

Stack Trace : (It's more or less the same as before, but posting it anyway for continuity)

MemoryError                               Traceback (most recent call last)
<ipython-input-10-b887e72bd6d5> in <module>()
----> 1 cluster_labels = clusterer.fit_predict(image)

/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.pyc in fit_predict(self, X, y)
    445             cluster labels
    446         """
--> 447         self.fit(X)
    448         return self.labels_
    449 

/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.pyc in fit(self, X, y)
    427          self._condensed_tree,
    428          self._single_linkage_tree,
--> 429          self._min_spanning_tree) = hdbscan(X, **self.get_params())
    430         return self
    431 

/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.pyc in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, algorithm, gen_min_span_tree)
    314         return _hdbscan_large_kdtree(X, min_cluster_size,
    315                                      min_samples, alpha, metric,
--> 316                                      p, gen_min_span_tree)
    317 
    318 

/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.pyc in _hdbscan_large_kdtree(X, min_cluster_size, min_samples, alpha, metric, p, gen_min_span_tree)
    136         p = 2
    137 
--> 138     mutual_reachability_ = kdtree_pdist_mutual_reachability(X, metric, p, min_samples, alpha)
    139 
    140     min_spanning_tree = mst_linkage_core(mutual_reachability_)

/usr/local/lib/python2.7/dist-packages/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2934)()

/usr/local/lib/python2.7/dist-packages/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2578)()

/usr/lib/python2.7/dist-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
   1174 
   1175     m, n = s
-> 1176     dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
   1177 
   1178     wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']

MemoryError: 

Let me know if I need to change something to work with hdbscan 0.2. I can share the image I used, if you would like to use for testing.

lmcinnes commented 8 years ago

Hmm, that's disappointing -- it doesn't seem to be using the appropriate code path that would spare you the problem. Try running

hdbscan.hdbscan(image, algorithm='large_kdtree_low_memory')

to force it to use that approach and let me know what happens.

lmcinnes commented 8 years ago

O(N log(N)) is looking promising and should be arriving on github in a week or two; hopefully I can push that out to PyPI for pip installs some time not to long after that. I'll keep you posted.

lmcinnes commented 8 years ago

The latest release should scale to very large scales particularly for low dimensional data (such as your 3 dimensional). I've successfully done 1000000 point datasets on my laptop in 20 minutes or so -- which should give you 1000x1000 images ... if you're willing to wait.

varadgunjal commented 8 years ago

Sorry to bother you again on this. I had shifted to using feature vector representations and your implementation was performing remarkably well there. However, I'm now back to requiring pixel color values.

So I tried the code on a 200x200 image with a min_cluster_size of 100. No memory errors this time and worked pretty fast as well - however, when I check the number of clusters I get 40k. Basically, for every image I tried, the number of clusters is equal to the number of pixels in the image. Based on my experience with simple vectors, hdbscan always found structural clusters for me and I was hoping it would cluster relevant colors together here.

Am I doing something wrong?

jc-healy commented 8 years ago

Hi,

I'm not sure off the top of my head what might cause that. Want to send me the code you're running and an example image? I can see if I can reproduce the problem and try to track down what might be causing it.

Out of curiosity, what distance function are you using?

Cheers John On Dec 16, 2015 2:24 AM, "varadgunjal" notifications@github.com wrote:

Sorry to bother you again on this. I had shifted to using feature vector representations and your implementation was performing remarkably well there. However, I'm now back to requiring pixel color values.

So I tried the code on a 200x200 image with a min_cluster_size of 100. No memory errors this time and worked pretty fast as well - however, when I check the number of clusters I get 40k. Basically, for every image I tried, the number of clusters is equal to the number of pixels in the image. Based on my experience with simple vectors, hdbscan always found structural clusters for me and I was hoping it would cluster relevant colors together here.

Am I doing something wrong?

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165021777.

lmcinnes commented 8 years ago

Have you had much luck with this John? I just got some time and tried playing with images, and at least the dev version seems to be working fine for me ... I'll have to check if it works on some other systems with a stock version.

jc-healy commented 8 years ago

Not yet I was hoping to get the image he was working on in order to better track down the problem.

Cheers John On Dec 16, 2015 5:49 PM, "Leland McInnes" notifications@github.com wrote:

Have you had much luck with this John? I just got some time and tried playing with images, and at least the dev version seems to be working fine for me ... I'll have to check if it works on some other systems with a stock version.

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165273339.

varadgunjal commented 8 years ago

Here's my code (no different from earlier) -

import cv2
import numpy as np
import hdbscan

image = cv2.imread('/home/ubuntu/nikeair_3.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = image.reshape((image.shape[0] * image.shape[1], 3))

clusterer = hdbscan.HDBSCAN(min_cluster_size=100)
cluster_labels = clusterer.fit_predict(image)

print len(cluster_labels)

The output I get for the number of clusters is 65536 for the attached 256x256 image. Similar issue when using a more appropriate color space like Lab. I hope there is something obviously wrong with my usage here - but a similar simplistic usage was enough for clustering vectors so I didn't change much.

nikeair_3

jc-healy commented 8 years ago

Hi there,

I do notice one small difference between this code and the code in the first email. It looks like you forgot to include the unique command in your print statement.

You have print(len(cluster_labels)) which should print the a vector of length equal to the number of data points that you labeled. I tend to use cluster_labels.value_counts() but just calling unique before len would work too.

Let us know if that clears up the problem.

cheers, John

On Wed, Dec 16, 2015 at 11:44 PM, varadgunjal notifications@github.com wrote:

Here's my code (no different from earlier) -

import cv2 import numpy as np import hdbscan

image = cv2.imread('/home/ubuntu/nikeair_3.jpg') image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.reshape((image.shape[0] * image.shape[1], 3))

clusterer = hdbscan.HDBSCAN(min_cluster_size=100) cluster_labels = clusterer.fit_predict(image)

print len(cluster_labels)

The output I get for the number of clusters is 65536 for the attached 256x256 image. Similar issue when using a more appropriate color space like Lab. I hope there is something obviously wrong with my usage here - but a similar simplistic usage was enough for clustering vectors so I didn't change much.

[image: nikeair_3] https://cloud.githubusercontent.com/assets/7705170/11861462/8612ba16-a4a6-11e5-8dee-00b6627e88b7.jpg

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165335636.

lmcinnes commented 8 years ago

I assume you mean pd.Series(labels).value_counts() after having imported pandas as pd.

On Thu, Dec 17, 2015 at 10:22 AM, jc-healy notifications@github.com wrote:

Hi there,

I do notice one small difference between this code and the code in the first email. It looks like you forgot to include the unique command in your print statement.

You have print(len(cluster_labels)) which should print the a vector of length equal to the number of data points that you labeled. I tend to use cluster_labels.value_counts() but just calling unique before len would work too.

Let us know if that clears up the problem.

cheers, John

On Wed, Dec 16, 2015 at 11:44 PM, varadgunjal notifications@github.com wrote:

Here's my code (no different from earlier) -

import cv2 import numpy as np import hdbscan

image = cv2.imread('/home/ubuntu/nikeair_3.jpg') image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.reshape((image.shape[0] * image.shape[1], 3))

clusterer = hdbscan.HDBSCAN(min_cluster_size=100) cluster_labels = clusterer.fit_predict(image)

print len(cluster_labels)

The output I get for the number of clusters is 65536 for the attached 256x256 image. Similar issue when using a more appropriate color space like Lab. I hope there is something obviously wrong with my usage here - but a similar simplistic usage was enough for clustering vectors so I didn't change much.

[image: nikeair_3] < https://cloud.githubusercontent.com/assets/7705170/11861462/8612ba16-a4a6-11e5-8dee-00b6627e88b7.jpg

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165335636.

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165481915.

jc-healy commented 8 years ago

Hi Leland,

Yep, sorry I was typing python without a notebook or tab completion. ;-)

Cheers, John

On Thu, Dec 17, 2015 at 11:28 AM, Leland McInnes notifications@github.com wrote:

I assume you mean pd.Series(labels).value_counts() after having imported pandas as pd.

On Thu, Dec 17, 2015 at 10:22 AM, jc-healy notifications@github.com wrote:

Hi there,

I do notice one small difference between this code and the code in the first email. It looks like you forgot to include the unique command in your print statement.

You have print(len(cluster_labels)) which should print the a vector of length equal to the number of data points that you labeled. I tend to use cluster_labels.value_counts() but just calling unique before len would work too.

Let us know if that clears up the problem.

cheers, John

On Wed, Dec 16, 2015 at 11:44 PM, varadgunjal notifications@github.com wrote:

Here's my code (no different from earlier) -

import cv2 import numpy as np import hdbscan

image = cv2.imread('/home/ubuntu/nikeair_3.jpg') image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.reshape((image.shape[0] * image.shape[1], 3))

clusterer = hdbscan.HDBSCAN(min_cluster_size=100) cluster_labels = clusterer.fit_predict(image)

print len(cluster_labels)

The output I get for the number of clusters is 65536 for the attached 256x256 image. Similar issue when using a more appropriate color space like Lab. I hope there is something obviously wrong with my usage here - but a similar simplistic usage was enough for clustering vectors so I didn't change much.

[image: nikeair_3] <

https://cloud.githubusercontent.com/assets/7705170/11861462/8612ba16-a4a6-11e5-8dee-00b6627e88b7.jpg

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165335636.

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165481915.

— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165501239.