Mini-batch K-means? - Githubissues

Hi! The library was developed for the case of dimensionality reduction in high dimensional space with ~1000 dimensions. The second property is that the number of clusters is huge. Minibatch clustering works very bad if the number of clusters is big because each minibatch is expected to contain samples for all the clusters. In your case with 20M samples and 4M clusters, the minibatch approach is likely to perform significantly worse (in terms of quality) than the regular kmeans, but of course I don't know the nature of the data. On my own data, the accuracy loss was around 95% (very, very bad).

But anyway, minibatch kmeans is a different algorithm and it is already implemented properly in many libraries, e.g. Tensorflow, so I would suggest using them instead. Since currently KMCUDA is in maintenance mode (all features which are required for the company are implemented), there are no plans to implement it in the future.

src-d / kmcuda

Mini-batch K-means? #48