mlampros / ClusterR

Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering
https://mlampros.github.io/ClusterR/
84 stars 29 forks source link

Unexpected failure of of Kmeans_rcpp #23

Closed jchiquet closed 3 years ago

jchiquet commented 3 years ago

This simple example fails with the Rcpp version of the kmeans algorithm found in ClusterR

ClusterR::KMeans_rcpp(matrix(c(1,-1,-1,-1,-1,1,1,1), 4, 2), 2)

with the following error

Error in KMEANS_rcpp(data, clusters, num_init, max_iters, initializer,  : 
  unique(): detected NaN

Note that base::kmeans and ClusterR::KMeans_arma both work.

OS: Ubuntu 20.04, R 4.0.5, ClusterR 1.2.4

mlampros commented 3 years ago

Hi @jchiquet and thanks for reporting this error. It seems it's related with the initializer. I guess the stats::kmeans() picks the initial centroids randomly,


mt = matrix(c(1,-1,-1,-1,-1,1,1,1), 4, 2)
k = 2

# base R
clust1 = stats::kmeans(x = mt, centers = k)
clust1

# RcppArmadillo
seed_mode = c('static_subset', 'random_subset', 'static_spread', 'random_spread')

clust2 = lapply(1:length(seed_mode), function(x) {
  ClusterR::KMeans_arma(data = mt, clusters = k, seed_mode = seed_mode[x])
})

# Rcpp
inits = c('optimal_init', 'random')

clust2 = lapply(1:length(inits), function(x) {
  ClusterR::KMeans_rcpp(data = mt, clusters = 2, initializer = inits[x])
})

I receive an error when the initializer of the ClusterR::KMeans_rcpp() function is set to either 'kmeans++' (default method) or to 'quantile_init' (experimental)

In my opinion the observations of your dataset are quite few for the 'kmeans++' initializer to work. You can have a look to the Rcpp code here

On the other hand the 'quantile_init' initializer does not work (I guess) for the same reason (few observations) because it has to compute the quantiles first to come to potential centroids.

Can you use one of the other initializers that work to your data ('optimal_init', 'random')?

jchiquet commented 3 years ago

Indeed, I use the kmeans in a split-and-merge strategy to avoid local minima in a more general model-based clustering method. Sometimes, kmeans is run on 'extreme' situations just like this one. I shall add some additional tests on my side and/or change the initializer.

Anyway, many thanks for the explanation and the follow-up.