mlampros / ClusterR

Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering
https://mlampros.github.io/ClusterR/
84 stars 29 forks source link

pedantic, but order components by mean, covariance #20

Closed EngrStudent closed 4 years ago

EngrStudent commented 4 years ago

If I ran this 20 times over the same data, I could get the same components, but in different order it would look like they weren't the same components for the GMM.

Here is my problem:

$centroids
            [,1]
[1,]  1.47999270
[2,]  0.03912894
[3,] -0.92191179
[4,] -0.50627765

$covariance_matrices
           [,1]
[1,] 0.52975804
[2,] 0.10771578
[3,] 0.25497064
[4,] 0.06696256

$weights
[1] 0.2384753 0.2035550 0.1886731 0.3692966

The means are not in descending order, so I could get permutations of centroids, associated covariances, and associated weights.

Therefore I suggest: sort by mean location, and order the covariances and weights in that way. Now I'm dealing with 1d data right now, and you have to make this work with multidimensional data

gmm_idx <- order(gmm$centroids,decreasing = T)
gmm$centroids <- gmm$centroids[gmm_idx]
gmm$covariance_matrices <- gmm$covariance_matrices[gmm_idx]
gmm$weights <- gmm$weights[gmm_idx]
mlampros commented 4 years ago

Hi @EngrStudent,

the following example returns the 'centroids', 'covariance_matrices' and 'weights' in the same order,


data(dietary_survey_IBS, package = 'ClusterR')

dat = as.matrix(dietary_survey_IBS[, -ncol(dietary_survey_IBS)])

dat = ClusterR::center_scale(dat)

seed_ibs = 1

for (dist_meth in c('eucl_dist', 'maha_dist')) {

  gmm_ibs = list()

  for (i in 1:20) {

    gmm_ibs[[i]] = ClusterR::GMM(data = dat,
                                 gaussian_comps = 2, 
                                 dist_mode = dist_meth, 
                                 seed_mode = "random_subset", 
                                 km_iter = 10, 
                                 em_iter = 10,
                                 seed = seed_ibs)
  }

  cent_ibs = lapply(gmm_ibs, function(x) x$centroids)
  cov_ibs = lapply(gmm_ibs, function(x) x$covariance_matrices)
  weigh_ibs = lapply(gmm_ibs, function(x) x$weights)

  cat("are all centroids equal for ", dist_meth, " method: ", all(unlist(lapply(cent_ibs[-1], function(y) all(unlist(cent_ibs[[1]] == y))))), '\n')
  cat("are all covariance matrices equal for ", dist_meth, " method: ", all(unlist(lapply(cov_ibs[-1], function(y) all(unlist(cov_ibs[[1]] == y))))), '\n')
  cat("are all weights equal for ", dist_meth, " method: ", all(unlist(lapply(weigh_ibs[-1], function(y) all(unlist(weigh_ibs[[1]] == y))))), '\n')
}

If this is not the case for your data set, would you mind adding a reproducible example, to find out if there is a bug in the function. thanks.

stale[bot] commented 4 years ago

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.