Closed varadgunjal closed 8 years ago
There's no obvious problem with what you've done. The thing to keep in mind is that even with a 200x200 image you've reshaped to get 40000 points and the algorithm needs to generate an all pairs distance matrix of that (or, at least, the upper triangular portion thereof) which has roughly 800000000 entries. The result is that you're simply running out of RAM to store that array. KMeans and DBSCAN implementations use methods (such as kdtrees) to avoid having to compute and store all those distances, unfortunately the current implementation of HDBSCAN does not support that (and requires some different approaches to make that happen). It is on the theoretical roadmap, but I'm not there yet. If you can run on a box with more RAM it will work; I have successfully clustered larger datasets (up to 128000 points in testing) but I was using a large memory SMP to do it.
Thought as much. Will try on more RAM. Out of curiosity, when and how do you think you will work in the kdtrees implementation?
Another question - any way to access cluster centers? Related KMeans example below with highlights -
import cv2
import numpy as np
from sklearn.cluster import KMeans
image = cv2.imread('/home/ubuntu/x.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = image.reshape((image.shape[0] * image.shape[1], 3))
clt = KMeans(n_clusters = 4)
clt.fit(image)
centers = clt.cluster_centers_ # this
Glad you're interested. The clusters themselves don't have centers. We don't make the gaussian ball assumption like kmeans as such the idea of a cluster center would be poorly defined. In many clusters, such as long snake like ones most concepts of a center as a good descriptor of the cluster would be misleading in many cases. Further, in higher dimensions the clusters shapes could be quite difficult to distinguish making this more problematic. What are hoping to use the centers for? I personally prefer to sample points from the cluster itself to get a good feel for what the cluster represents.
Regarding the kdtrees,j we'll probably get to those in the next few months.
Cheers John
On Oct 10, 2015 12:30 PM, "varadgunjal" notifications@github.com wrote:
Thought as much. Will try on more RAM. Out of curiosity, when and how do you think you will work in the kdtrees implementation?
Another question - any way to access cluster centers? Related KMeans example below with highlights -
import cv2 import numpy as np import hdbscan
image = cv2.imread('/home/ubuntu/x. jpg') image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.reshape((image.shape[0] * image.shape[1], 3))
clt = KMeans(n_clusters = 4) clt.fit(image)
centers = clt.clustercenters
— Reply to this email directly or view it on GitHub.
That makes a lot of sense. I was hoping to use the centroids, as you mentioned, to be a descriptor of the cluster. But I agree, sampling points makes more sense and would actually improve what I'm trying to do : I'm trying to get use this clustering to figure out the 'prominent' or most representative colors in an image. So what I was doing earlier was using sklearn's MeanShift to get a good prediction of the number of representative clusters and then KMeans / MiniBatchKMeans to cluster them accordingly. And then using the centroids as the representative colors in the image. However, the averaging out was actually not doing a good job of getting the actual color and was bringing out everything with a greyish tinge.
Would you have any suggestions on how I should sample the cluster I would get, say using HDBSCAN, so as to get good representations?
How to sample clusters really depends upon the cluster size. If you have small enough clusters you can just take all the points; otherwise any sort of uniform random sampling from members of the clusters would be okay. We are looking to implement some semblance of soft clustering in the near future, so you could use that to bias your sample once we get that done.
For what you're doing a reasonable approach may just be to look at the mean and standard deviation of the colours in each cluster. DBSCAN should (and in our experience does) give relatively pure clusters a lot of the time, so presuming the standard deviation is low you can just use the mean colours. The other thing to note is that you probably want to transform the colour space, especially for taking averages, and look into gamma correction and the curves involved. There's a lot of useful colour theory out there on how to 'average' colours properly that can also help avoid the "washed out gray" result.
I did try using DBSCAN, but (correct me if I'm wrong) I found it works better if the density / size of all expected clusters is similar. Which isn't the case in my data - or so I believe, because the results were less than spectacular.
I found MeanShift gave a better estimate of clusters which significantly reduced the graying effect already. And I'm already using the CIELAB color space to be in the clear. But while I wait on your soft clustering implementation, any chance you can refer me to the gamma correction / transformation routines you alluded to? (papers / code)
If you're doing everything in CIELAB then you're probably fine. Soft clustering was just committed -- hopefully it works as expected. Let me know if you run into any issues.
I've just committed changes that support a low memory approach to large datasets. It is not the full work that gets the asymptotic down, but it should allow you to run on large dataset sizes without running into memory errors.
Oh cool. Will check it out and revert with feedback.
So, I tested with the same code as before. Used a 300x300 image. Still run into a MemoryError.
Stack Trace : (It's more or less the same as before, but posting it anyway for continuity)
MemoryError Traceback (most recent call last)
<ipython-input-10-b887e72bd6d5> in <module>()
----> 1 cluster_labels = clusterer.fit_predict(image)
/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.pyc in fit_predict(self, X, y)
445 cluster labels
446 """
--> 447 self.fit(X)
448 return self.labels_
449
/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.pyc in fit(self, X, y)
427 self._condensed_tree,
428 self._single_linkage_tree,
--> 429 self._min_spanning_tree) = hdbscan(X, **self.get_params())
430 return self
431
/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.pyc in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, algorithm, gen_min_span_tree)
314 return _hdbscan_large_kdtree(X, min_cluster_size,
315 min_samples, alpha, metric,
--> 316 p, gen_min_span_tree)
317
318
/usr/local/lib/python2.7/dist-packages/hdbscan/hdbscan_.pyc in _hdbscan_large_kdtree(X, min_cluster_size, min_samples, alpha, metric, p, gen_min_span_tree)
136 p = 2
137
--> 138 mutual_reachability_ = kdtree_pdist_mutual_reachability(X, metric, p, min_samples, alpha)
139
140 min_spanning_tree = mst_linkage_core(mutual_reachability_)
/usr/local/lib/python2.7/dist-packages/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2934)()
/usr/local/lib/python2.7/dist-packages/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2578)()
/usr/lib/python2.7/dist-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1174
1175 m, n = s
-> 1176 dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
1177
1178 wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']
MemoryError:
Let me know if I need to change something to work with hdbscan 0.2. I can share the image I used, if you would like to use for testing.
Hmm, that's disappointing -- it doesn't seem to be using the appropriate code path that would spare you the problem. Try running
hdbscan.hdbscan(image, algorithm='large_kdtree_low_memory')
to force it to use that approach and let me know what happens.
O(N log(N)) is looking promising and should be arriving on github in a week or two; hopefully I can push that out to PyPI for pip installs some time not to long after that. I'll keep you posted.
The latest release should scale to very large scales particularly for low dimensional data (such as your 3 dimensional). I've successfully done 1000000 point datasets on my laptop in 20 minutes or so -- which should give you 1000x1000 images ... if you're willing to wait.
Sorry to bother you again on this. I had shifted to using feature vector representations and your implementation was performing remarkably well there. However, I'm now back to requiring pixel color values.
So I tried the code on a 200x200 image with a min_cluster_size of 100. No memory errors this time and worked pretty fast as well - however, when I check the number of clusters I get 40k. Basically, for every image I tried, the number of clusters is equal to the number of pixels in the image. Based on my experience with simple vectors, hdbscan always found structural clusters for me and I was hoping it would cluster relevant colors together here.
Am I doing something wrong?
Hi,
I'm not sure off the top of my head what might cause that. Want to send me the code you're running and an example image? I can see if I can reproduce the problem and try to track down what might be causing it.
Out of curiosity, what distance function are you using?
Cheers John On Dec 16, 2015 2:24 AM, "varadgunjal" notifications@github.com wrote:
Sorry to bother you again on this. I had shifted to using feature vector representations and your implementation was performing remarkably well there. However, I'm now back to requiring pixel color values.
So I tried the code on a 200x200 image with a min_cluster_size of 100. No memory errors this time and worked pretty fast as well - however, when I check the number of clusters I get 40k. Basically, for every image I tried, the number of clusters is equal to the number of pixels in the image. Based on my experience with simple vectors, hdbscan always found structural clusters for me and I was hoping it would cluster relevant colors together here.
Am I doing something wrong?
— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165021777.
Have you had much luck with this John? I just got some time and tried playing with images, and at least the dev version seems to be working fine for me ... I'll have to check if it works on some other systems with a stock version.
Not yet I was hoping to get the image he was working on in order to better track down the problem.
Cheers John On Dec 16, 2015 5:49 PM, "Leland McInnes" notifications@github.com wrote:
Have you had much luck with this John? I just got some time and tried playing with images, and at least the dev version seems to be working fine for me ... I'll have to check if it works on some other systems with a stock version.
— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165273339.
Here's my code (no different from earlier) -
import cv2
import numpy as np
import hdbscan
image = cv2.imread('/home/ubuntu/nikeair_3.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = image.reshape((image.shape[0] * image.shape[1], 3))
clusterer = hdbscan.HDBSCAN(min_cluster_size=100)
cluster_labels = clusterer.fit_predict(image)
print len(cluster_labels)
The output I get for the number of clusters is 65536 for the attached 256x256 image. Similar issue when using a more appropriate color space like Lab. I hope there is something obviously wrong with my usage here - but a similar simplistic usage was enough for clustering vectors so I didn't change much.
Hi there,
I do notice one small difference between this code and the code in the first email. It looks like you forgot to include the unique command in your print statement.
You have print(len(cluster_labels)) which should print the a vector of length equal to the number of data points that you labeled. I tend to use cluster_labels.value_counts() but just calling unique before len would work too.
Let us know if that clears up the problem.
cheers, John
On Wed, Dec 16, 2015 at 11:44 PM, varadgunjal notifications@github.com wrote:
Here's my code (no different from earlier) -
import cv2 import numpy as np import hdbscan
image = cv2.imread('/home/ubuntu/nikeair_3.jpg') image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.reshape((image.shape[0] * image.shape[1], 3))
clusterer = hdbscan.HDBSCAN(min_cluster_size=100) cluster_labels = clusterer.fit_predict(image)
print len(cluster_labels)
The output I get for the number of clusters is 65536 for the attached 256x256 image. Similar issue when using a more appropriate color space like Lab. I hope there is something obviously wrong with my usage here - but a similar simplistic usage was enough for clustering vectors so I didn't change much.
[image: nikeair_3] https://cloud.githubusercontent.com/assets/7705170/11861462/8612ba16-a4a6-11e5-8dee-00b6627e88b7.jpg
— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165335636.
I assume you mean pd.Series(labels).value_counts() after having imported pandas as pd.
On Thu, Dec 17, 2015 at 10:22 AM, jc-healy notifications@github.com wrote:
Hi there,
I do notice one small difference between this code and the code in the first email. It looks like you forgot to include the unique command in your print statement.
You have print(len(cluster_labels)) which should print the a vector of length equal to the number of data points that you labeled. I tend to use cluster_labels.value_counts() but just calling unique before len would work too.
Let us know if that clears up the problem.
cheers, John
On Wed, Dec 16, 2015 at 11:44 PM, varadgunjal notifications@github.com wrote:
Here's my code (no different from earlier) -
import cv2 import numpy as np import hdbscan
image = cv2.imread('/home/ubuntu/nikeair_3.jpg') image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.reshape((image.shape[0] * image.shape[1], 3))
clusterer = hdbscan.HDBSCAN(min_cluster_size=100) cluster_labels = clusterer.fit_predict(image)
print len(cluster_labels)
The output I get for the number of clusters is 65536 for the attached 256x256 image. Similar issue when using a more appropriate color space like Lab. I hope there is something obviously wrong with my usage here - but a similar simplistic usage was enough for clustering vectors so I didn't change much.
[image: nikeair_3] < https://cloud.githubusercontent.com/assets/7705170/11861462/8612ba16-a4a6-11e5-8dee-00b6627e88b7.jpg
— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165335636.
— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165481915.
Hi Leland,
Yep, sorry I was typing python without a notebook or tab completion. ;-)
Cheers, John
On Thu, Dec 17, 2015 at 11:28 AM, Leland McInnes notifications@github.com wrote:
I assume you mean pd.Series(labels).value_counts() after having imported pandas as pd.
On Thu, Dec 17, 2015 at 10:22 AM, jc-healy notifications@github.com wrote:
Hi there,
I do notice one small difference between this code and the code in the first email. It looks like you forgot to include the unique command in your print statement.
You have print(len(cluster_labels)) which should print the a vector of length equal to the number of data points that you labeled. I tend to use cluster_labels.value_counts() but just calling unique before len would work too.
Let us know if that clears up the problem.
cheers, John
On Wed, Dec 16, 2015 at 11:44 PM, varadgunjal notifications@github.com wrote:
Here's my code (no different from earlier) -
import cv2 import numpy as np import hdbscan
image = cv2.imread('/home/ubuntu/nikeair_3.jpg') image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) image = image.reshape((image.shape[0] * image.shape[1], 3))
clusterer = hdbscan.HDBSCAN(min_cluster_size=100) cluster_labels = clusterer.fit_predict(image)
print len(cluster_labels)
The output I get for the number of clusters is 65536 for the attached 256x256 image. Similar issue when using a more appropriate color space like Lab. I hope there is something obviously wrong with my usage here - but a similar simplistic usage was enough for clustering vectors so I didn't change much.
[image: nikeair_3] <
https://cloud.githubusercontent.com/assets/7705170/11861462/8612ba16-a4a6-11e5-8dee-00b6627e88b7.jpg
— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165335636.
— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165481915.
— Reply to this email directly or view it on GitHub https://github.com/lmcinnes/hdbscan/issues/3#issuecomment-165501239.
Tried using the implementation as is for working with images, as was being done in sklearn using KMeans / MiniBatchKMeans / Meanshift clustering. But consistently run into MemoryError (for images as small as 200x200 as well). Here is a sample code -
Error stack trace :
Any obvious problem with the code? Or is this to be expected?