Open CharlesLiu7 opened 6 years ago
Hi @CharlesLiu7
You are doing everything correctly. Some of the centroids lose all the samples while iterating, and their coordinates are set to NaN. They effectively "die" afterward and can be safely excluded.
I try this code several times again, and I find sometimes the result contains NAN, but sometimes not. So I have the following solutions:
init
parameter to run the code again, but I have no idea about how to generate the centroids to substitute the contained NAN ones (random?)What do you think about the solution 2?
(your code helps me a lot, btw, thx again)
I try the following code for 10+ times:
n_clusters = 300000
data = np.load('dataset.npy')
center = np.load('centroids.npy’) # centroids result download link aforementioned, which contain NAN
centroids,_= kmeans_cuda(data, n_clusters, init=center, verbosity=2, yinyang_t=0)
and EVERY result I got has some centroids which contain NAN, these centroids are not dead and excluded, :cry: .
Yes, this works as expected, given the huge number of clusters, some of them lose all the samples and die. NaNs can be easily filtered out with
mask = ~np.isnan(centroids).any(axis=1)
centroids = centroids[mask]
cmap = np.full(len(mask), -1, dtype=int)
for i, x in enumerate(np.where(mask)):
cmap[x] = i
for i, ass in enumerate(assignments):
assignments[i] = cmap[ass]
kmcuda was designed for samples number reduction; this use case does not require the number of clusters to appear exactly 300,000. So the question is, why do you need to have exactly 300,000 clusters?
If you really, really need exactly 300,000, you can check how many centroids are usually dead, add this number to 300,000 with some excess, drop the few extra ones after clustering and calculate the final assignments with
kmeans_cuda(data, len(centroids), tolerance=1, init=centroids, yinyang_t=0)
I use your
kmcuda
to run k-means in a large datasets. The dataset contains 21138972 128-dimension vectors, and the target number of centroids is 30k. I use the following code:and it outputs:
I use the 4*TITAN X (Pascal) server and each GPU memory is 12192 MB.
BUT I find centroids matrix contain NAN, This is the datasets (I use
tar czf
to compress it) and the centroid result. Am I doing something wrong ?Thanks for any help!