mlampros / ClusterR

Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering
https://mlampros.github.io/ClusterR/
84 stars 29 forks source link

What does the parameter “seed” do in function Cluster_Medoids? (Issue #32--continue) #33

Closed A-Pai closed 2 years ago

A-Pai commented 2 years ago

`library(ClusterR)

data(dietary_survey_IBS) dat <- dietary_survey_IBS[, -ncol(dietary_survey_IBS)] dat <- center_scale(dat) cm <- Cluster_Medoids(dat, clusters = 3, distance_metric = "euclidean", swap_phase = TRUE, seed = 1) cm1 <- Cluster_Medoids(dat, clusters = 3, distance_metric = "euclidean", swap_phase = TRUE, seed = 1) cm2 <- Cluster_Medoids(dat, clusters = 3, distance_metric = "euclidean", swap_phase = TRUE, seed = 2)

identical(cm, cm1) identical(cm$call,cm1$call) identical(cm$medoids, cm1$medoids) identical(cm$medoid_indices, cm1$medoid_indices) identical(cm$best_dissimilarity, cm1$best_dissimilarity) identical(cm$dissimilarity_matrix, cm1$dissimilarity_matrix) identical(cm$clusters, cm1$clusters) identical(cm$silhouette_matrix, cm1$silhouette_matrix) identical(cm$fuzzy_probs, cm1$fuzzy_probs) identical(cm$clustering_stats, cm1$clustering_stats) identical(cm$distance_metric, cm1$distance_metric)

identical(cm,cm2) identical(cm$call,cm2$call) identical(cm$medoids, cm2$medoids) identical(cm$medoid_indices, cm2$medoid_indices) identical(cm$best_dissimilarity, cm2$best_dissimilarity) identical(cm$dissimilarity_matrix, cm2$dissimilarity_matrix) identical(cm$clusters, cm2$clusters) identical(cm$silhouette_matrix, cm2$silhouette_matrix) identical(cm$fuzzy_probs, cm2$fuzzy_probs) identical(cm$clustering_stats, cm2$clustering_stats) identical(cm$distance_metric, cm2$distance_metric)`

you will get: image image

you can see :“cm” is not identical to “cm2” just because “cm$call” is not identical to “cm2$call”,it is only calling expression different.

mlampros commented 2 years ago

@A-Pai, that's true, you are right. The only difference is in the call (the output call differs because the one has seed=1 and the other seed=2).

require(ClusterR)
require(glue)

data(dietary_survey_IBS)
dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
dat = center_scale(dat)
cm = Cluster_Medoids(dat, clusters = 3, distance_metric = 'euclidean', swap_phase = FALSE, seed = 1)
cm2 = Cluster_Medoids(dat, clusters = 3, distance_metric = 'euclidean', swap_phase = FALSE, seed = 2)

if (!all(names(cm) == names(cm2))) stop("The sublist names differ!")

nams = names(cm)
nams

for (item in nams) {
  cat(glue::glue("{item}: {identical(cm[[item]], cm2[[item]])}"), '\n')
}

# call: FALSE 
# medoids: TRUE 
# medoid_indices: TRUE 
# best_dissimilarity: TRUE 
# dissimilarity_matrix: TRUE 
# clusters: TRUE 
# silhouette_matrix: TRUE 
# fuzzy_probs: TRUE 
# clustering_stats: TRUE 
# distance_metric: TRUE 

print(cm$call)
# Cluster_Medoids(data = dat, clusters = 3, distance_metric = "euclidean", 
#                 swap_phase = FALSE, seed = 1)

The cluster-medoids differs from the kmeans algorithm because it doesn't have any initialization of the centroids (random etc.), and the medoids are picked based on the dissimilarity matrix which means the medoids are based on the selected distance-method (euclidean etc.) and this does not change from one run to another.

Give me a few days to add a deprecation warning for the "seed" parameter. Thank you for making me aware of this issue.

mlampros commented 2 years ago

I added a deprecation warning to the function, related to the 'seed' parameter and I'll remove this parameter in version 1.3.0

You can download the latest version using

remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')

Feel free to re-open the issue if the code does not work as expected