KMeans unexpected result

andreaferretti commented 9 years ago

I have a small benchmark of KMeans in various languages, so I was trying to compare K-means from unsup with a naive implementation in pure Lua.

What it does is:

it reads points from this file
it clusters them with 15 iterations of K-means, into 10 clusters

The expected centroids are reported here.

Now, I tried to do the same with unsup:

require 'io'
cjson = require 'cjson'
require 'torch'
require 'unsup'

local content = io.open("points.json"):read("*a")
local data = cjson.decode(content)
local points = torch.Tensor(data)
local timer = torch.Timer()
local centroids, counts = unsup.kmeans(points, 10, 15)
print(string.format('Time required: %f s', timer:time().real))

When I run this, there are three things that leave me baffled:

first, the results are wrong, with most centroids being 0
the run time is about 6 seconds on my laptop, while the naive implementation runs in 600ms
finally, the original dataset (points) is mutated

Is there something I am doing wrong?

soumith commented 9 years ago

The results being wrong is bad. I will take a look.

The kmeans here is probably optimizes for vectors of samples, rather than points of single numbers.

Regardless, it is not considered optimal code. Unsup is still treated as an experimental repo.

I can look at optimizing it further if needed.

andreaferretti commented 9 years ago

Thank you. It is entirely possible that I am doing something wrong, but I cannot find what

Aysegul commented 9 years ago

@andreaferretti one thing I notice is your implementation is 'k-means: euclidean distance', this implementation is 'k-means:spherical'. you can refer to page 16-17 here: http://www.cs.stanford.edu/~acoates/papers/acoates_thesis.pdf k-means:spherical is mostly used to train filters for networks.

andreaferretti commented 9 years ago

I see. I am still not sure, though. If I understand correctly, spherical K-means tries to locate centroids on the unit sphere while minimizing a certain functional, and the centroids that I obtain are certainly not of unit size. So, there may still be something wrong.

Moreover, I think the naming choice is misleading. I guess that spherical K-means has its place in the context of neural networks - I am not expert enough to tell - but outside this context, K-means is quite an established name. Maybe it would make sense to rename kmeans to skmeans and implement a standard (Euclidean) k-means as kmeans?

deltheil commented 9 years ago

The results being wrong is bad. I will take a look.

@soumith not so sure you were notified (?) but I made a comment on this PR - an issue would have been better.

I can look at optimizing it further if needed.

Regarding this I had a quick look once answering the question on SO - but I didn't had enough time to look deeper at that. I found that these two loops can be optimized as follow:

diff --git a/kmeans.lua b/kmeans.lua
index 436818c..276e0e4 100644
--- a/kmeans.lua
+++ b/kmeans.lua
@@ -67,17 +67,18 @@ function unsup.kmeans(x, k, niter, batchsize, callback, verbose)
          -- k-means step, on minibatch
          local batch = x[{ {i,lasti},{} }]
          local batch_t = batch:t()
-         local tmp = centroids * batch_t
-         for n = 1,(#batch)[1] do
-            tmp[{ {},n }]:add(-1,c2)
-         end
+         local off = torch.expand(c2, k, m)
+         local tmp = torch.addmm(-1, off, centroids, batch_t)
          local val,labels = max(tmp,1)
          loss = loss + sum(x2[{ {i,lasti} }]*0.5 - val:t())

          -- count examplars per template
          local S = x.new(m,k):zero()
-         for i = 1,(#labels)[2] do
-            S[i][labels[1][i]] = 1
+         assert(S:isContiguous() and labels:isContiguous())
+         local sd = S:data()
+         local ld = labels:data()
+         for i = 0,m-1 do
+            sd[k*i+ld[i]-1] = 1
          end
          summation:add( S:t() * batch )
          counts:add( sum(S,1) )

This gives a huge speed-up on my side (like x 20) on this dataset. Let me know if you have any feedback or if this can be of interest in terms of PR.

soumith commented 9 years ago

@deltheil if it's the same result and gives a huge speedup, why not. Does it balk at large datasets? (i see a batchSize, so assuming not.). Can you send a PR for this?

deltheil commented 9 years ago

why not [...] Does it balk at large datasets?

Well, I played with the kmeans torch tutorial for which the dataset is larger (500,000 samples, 1,024 centroids and 81-dimensional).

On such data it doesn't make a real difference performance-wise. And with default parameter (batchsize=1,000), for some reason, the memory consumption increases like crazy along iterations unless a manual collectgarbage is performed after each batch (perf. killer) or a smaller batchsize is used (to limit the peak). It sounds a bit clunky...

So I sent a PR for the 1st problem only (#28).

koraykv commented 9 years ago

I think this was fixed in #28 , so closing.

koraykv / unsup

KMeans unexpected result #27