src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Other
784 stars 144 forks source link

Mini-batch K-means? #48

Closed EnJiang closed 5 years ago

EnJiang commented 5 years ago

First of all, your code is brilliant! As to my understanding, most users of you project are those who have a massive dataset to run k-means on(in my case, ~20M points in ~10 dimension space, ~4M clusters). For those datasets, using mini-batch k-means is better(significantly faster, little accuracy loss), wouldn't it be great if you have that implemented?

vmarkovtsev commented 5 years ago

Hi! The library was developed for the case of dimensionality reduction in high dimensional space with ~1000 dimensions. The second property is that the number of clusters is huge. Minibatch clustering works very bad if the number of clusters is big because each minibatch is expected to contain samples for all the clusters. In your case with 20M samples and 4M clusters, the minibatch approach is likely to perform significantly worse (in terms of quality) than the regular kmeans, but of course I don't know the nature of the data. On my own data, the accuracy loss was around 95% (very, very bad).

But anyway, minibatch kmeans is a different algorithm and it is already implemented properly in many libraries, e.g. Tensorflow, so I would suggest using them instead. Since currently KMCUDA is in maintenance mode (all features which are required for the company are implemented), there are no plans to implement it in the future.