Why K-means centroids contain NAN?

CharlesLiu7 commented 6 years ago

I use your kmcuda to run k-means in a large datasets. The dataset contains 21138972 128-dimension vectors, and the target number of centroids is 30k. I use the following code:

import time
import numpy as np
from libKMCUDA import kmeans_cuda
n_clusters = 300000
data = np.load('dataset.npy')
t0 = time.time()
centroids,_=kmeans_cuda(data, n_clusters, verbosity=1, device=0, yinyang_t=0)
t_mini_batch = time.time() - t0
print('This iteration takes ', t_mini_batch, 's')

and it outputs:

reassignments threshold: 211389
transposing the samples...
performing kmeans++...
done
too few clusters for this yinyang_t => Lloyd
iteration 1: 21138972 reassignments
iteration 2: 8493647 reassignments
iteration 3: 4261612 reassignments
iteration 4: 2966279 reassignments
iteration 5: 2296016 reassignments
iteration 6: 1872182 reassignments
iteration 7: 1583827 reassignments
iteration 8: 1369100 reassignments
iteration 9: 1203159 reassignments
iteration 10: 1073277 reassignments
iteration 11: 964482 reassignments
iteration 12: 874202 reassignments
iteration 13: 798805 reassignments
iteration 14: 734298 reassignments
iteration 15: 676895 reassignments
iteration 16: 629083 reassignments
iteration 17: 585535 reassignments
iteration 18: 549050 reassignments
iteration 19: 514284 reassignments
iteration 20: 482728 reassignments
iteration 21: 455612 reassignments
iteration 22: 429360 reassignments
iteration 23: 405368 reassignments
iteration 24: 385095 reassignments
iteration 25: 367105 reassignments
iteration 26: 349844 reassignments
iteration 27: 333942 reassignments
iteration 28: 319134 reassignments
iteration 29: 305205 reassignments
iteration 30: 292143 reassignments
iteration 31: 280165 reassignments
iteration 32: 267917 reassignments
iteration 33: 257570 reassignments
iteration 34: 247747 reassignments
iteration 35: 237848 reassignments
iteration 36: 228632 reassignments
iteration 37: 219988 reassignments
iteration 38: 210930 reassignments
This iteration takes  20040.06494617462 s

I use the 4*TITAN X (Pascal) server and each GPU memory is 12192 MB.

BUT I find centroids matrix contain NAN, This is the datasets (I use tar czf to compress it) and the centroid result. Am I doing something wrong ?

Thanks for any help!

vmarkovtsev commented 6 years ago

Hi @CharlesLiu7

You are doing everything correctly. Some of the centroids lose all the samples while iterating, and their coordinates are set to NaN. They effectively "die" afterward and can be safely excluded.

CharlesLiu7 commented 6 years ago

I try this code several times again, and I find sometimes the result contains NAN, but sometimes not. So I have the following solutions:

I can continue to retry until the result doesn’t contain NAN, but it’s time comsuming.
I can reuse the result and substitute the centroids contain NAN, use it as the initial init parameter to run the code again, but I have no idea about how to generate the centroids to substitute the contained NAN ones (random?)

What do you think about the solution 2?

(your code helps me a lot, btw, thx again)

CharlesLiu7 commented 6 years ago

I try the following code for 10+ times:

n_clusters = 300000
data = np.load('dataset.npy')
center = np.load('centroids.npy’) # centroids result download link aforementioned, which contain NAN
centroids,_= kmeans_cuda(data, n_clusters, init=center, verbosity=2, yinyang_t=0)

and EVERY result I got has some centroids which contain NAN, these centroids are not dead and excluded, :cry: .

vmarkovtsev commented 6 years ago

Yes, this works as expected, given the huge number of clusters, some of them lose all the samples and die. NaNs can be easily filtered out with

mask = ~np.isnan(centroids).any(axis=1)
centroids = centroids[mask]
cmap = np.full(len(mask), -1, dtype=int)
for i, x in enumerate(np.where(mask)):
    cmap[x] = i
for i, ass in enumerate(assignments):
    assignments[i] = cmap[ass]

kmcuda was designed for samples number reduction; this use case does not require the number of clusters to appear exactly 300,000. So the question is, why do you need to have exactly 300,000 clusters?

If you really, really need exactly 300,000, you can check how many centroids are usually dead, add this number to 300,000 with some excess, drop the few extra ones after clustering and calculate the final assignments with

kmeans_cuda(data, len(centroids), tolerance=1, init=centroids, yinyang_t=0)

src-d / kmcuda

Why K-means centroids contain NAN? #61