Closed andreaferretti closed 9 years ago
The results being wrong is bad. I will take a look.
The kmeans here is probably optimizes for vectors of samples, rather than points of single numbers.
Regardless, it is not considered optimal code. Unsup is still treated as an experimental repo.
I can look at optimizing it further if needed.
Thank you. It is entirely possible that I am doing something wrong, but I cannot find what
@andreaferretti one thing I notice is your implementation is 'k-means: euclidean distance', this implementation is 'k-means:spherical'. you can refer to page 16-17 here: http://www.cs.stanford.edu/~acoates/papers/acoates_thesis.pdf k-means:spherical is mostly used to train filters for networks.
I see. I am still not sure, though. If I understand correctly, spherical K-means tries to locate centroids on the unit sphere while minimizing a certain functional, and the centroids that I obtain are certainly not of unit size. So, there may still be something wrong.
Moreover, I think the naming choice is misleading. I guess that spherical K-means has its place in the context of neural networks - I am not expert enough to tell - but outside this context, K-means is quite an established name. Maybe it would make sense to rename kmeans
to skmeans
and implement a standard (Euclidean) k-means as kmeans
?
The results being wrong is bad. I will take a look.
@soumith not so sure you were notified (?) but I made a comment on this PR - an issue would have been better.
I can look at optimizing it further if needed.
Regarding this I had a quick look once answering the question on SO - but I didn't had enough time to look deeper at that. I found that these two loops can be optimized as follow:
diff --git a/kmeans.lua b/kmeans.lua
index 436818c..276e0e4 100644
--- a/kmeans.lua
+++ b/kmeans.lua
@@ -67,17 +67,18 @@ function unsup.kmeans(x, k, niter, batchsize, callback, verbose)
-- k-means step, on minibatch
local batch = x[{ {i,lasti},{} }]
local batch_t = batch:t()
- local tmp = centroids * batch_t
- for n = 1,(#batch)[1] do
- tmp[{ {},n }]:add(-1,c2)
- end
+ local off = torch.expand(c2, k, m)
+ local tmp = torch.addmm(-1, off, centroids, batch_t)
local val,labels = max(tmp,1)
loss = loss + sum(x2[{ {i,lasti} }]*0.5 - val:t())
-- count examplars per template
local S = x.new(m,k):zero()
- for i = 1,(#labels)[2] do
- S[i][labels[1][i]] = 1
+ assert(S:isContiguous() and labels:isContiguous())
+ local sd = S:data()
+ local ld = labels:data()
+ for i = 0,m-1 do
+ sd[k*i+ld[i]-1] = 1
end
summation:add( S:t() * batch )
counts:add( sum(S,1) )
This gives a huge speed-up on my side (like x 20) on this dataset. Let me know if you have any feedback or if this can be of interest in terms of PR.
@deltheil if it's the same result and gives a huge speedup, why not. Does it balk at large datasets? (i see a batchSize, so assuming not.). Can you send a PR for this?
why not [...] Does it balk at large datasets?
Well, I played with the kmeans torch tutorial for which the dataset is larger (500,000 samples, 1,024 centroids and 81-dimensional).
On such data it doesn't make a real difference performance-wise. And with default parameter (batchsize=1,000), for some reason, the memory consumption increases like crazy along iterations unless a manual collectgarbage
is performed after each batch (perf. killer) or a smaller batchsize is used (to limit the peak). It sounds a bit clunky...
So I sent a PR for the 1st problem only (#28).
I think this was fixed in #28 , so closing.
I have a small benchmark of KMeans in various languages, so I was trying to compare K-means from
unsup
with a naive implementation in pure Lua.What it does is:
The expected centroids are reported here.
Now, I tried to do the same with
unsup
:When I run this, there are three things that leave me baffled:
Is there something I am doing wrong?