The solution of package ClusterR is not best

A-Pai commented 2 years ago

The solution of package ClusterR is not best:

library(ClusterR)

n <- 100 # data size k <- 2 set.seed(3) x <- rbind( matrix(rnorm(n, sd = 0.25), ncol = 2), matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2) ) ds <- distance_matrix(x, method = "euclidean", upper = TRUE, diagonal = TRUE)

cm <- Cluster_Medoids(x, k, distance_metric = "euclidean") cat("package ClusterR solutuion") cat("\n") cat("medoid_indices:", cm$medoid_indices) cat("\n") cat("best_dissimilarity:", cm$best_dissimilarity)

s <- c(20, 57) # my medoid_indices b <- ds[, s[1]] < ds[, s[2]] cat("\n") cat("\n") cat("my solutuion") cat("\n") cat("medoid_indices:", s) cat("\n") cat("best_dissimilarity:", sum(ds[b, s[1]]) + sum(ds[!b, s[2]]))

library(cluster) pm <- pam(x, k, metric = "euclidean") cat("\n") cat("\n") cat("package cluster solutuion") cat("\n") cat("medoid_indices:", pm$id.med) cat("\n") cat("best_dissimilarity:", n * pm$objective[2])

mlampros commented 2 years ago

@A-Pai your results seem to be the same with the "cluster::pam()" function when the swap-phase is TRUE. I can reproduce this using the following code,


library(ClusterR)

n <- 100 # data size
k <- 2
set.seed(3)
x <- rbind(
  matrix(rnorm(n, sd = 0.25), ncol = 2),
  matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2)
)

dim(x)

# swap-phase = FALSE

cm = Cluster_Medoids(x, k, distance_metric = "euclidean", swap_phase = F)  # ClusterR 
cm_1 = cluster::pam(x = x, k = k, metric = 'euclidean', do.swap = F)       # cluster

all(cm$medoid_indices == cm_1$id.med)
# TRUE

# swap-phase = TRUE

cm_sw = Cluster_Medoids(x, k, distance_metric = "euclidean", swap_phase = T)  # ClusterR 
cm_sw_1 = cluster::pam(x = x, k = k, metric = 'euclidean', do.swap = T)       # cluster

all(cm_sw$medoid_indices == cm_sw_1$id.med)
# FALSE

Thus, for your edge case there seems to be a bug in the ClusterR::medoids() function that does not return the optimal medoids when the swap phase is TRUE. Let me have a look in the next days with other datasets as well.

A-Pai commented 2 years ago

It's not optimal in this case either（compared with package cluster）：

n <- 100 # data size set.seed(3) x <- rbind( matrix(rnorm(n, sd = 0.25), ncol = 2), matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2) )

k <- 5

library(ClusterR) cat("\n") cm <- Cluster_Medoids(x, k, distance_metric = "euclidean") cat("package ClusterR solutuion") cat("\n") cat("medoid_indices:", sort(cm$medoid_indices)) cat("\n") cat("best_dissimilarity:", cm$best_dissimilarity)

library(cluster) pm <- pam(x, k, metric = "euclidean") cat("\n") cat("\n") cat("package cluster solutuion") cat("\n") cat("medoid_indices:", sort(pm$id.med)) cat("\n") cat("best_dissimilarity:", n * pm$objective[2])

mlampros commented 2 years ago

@A-Pai thanks for making me aware of this issue. I used also other datasets (besides your sample) data and there are differences in the results between the 'ClusterR::Cluster_Medoids()' and other R package K-medoids implementations. There is no quick fix for this, I have to restructure and update the code. I'll do it as soon as possible depending on my available time.

mlampros commented 2 years ago

@A-Pai

How do you come to this line when trying to compute the dissimilarity based on the cluster R package

cat("best_dissimilarity:", n * pm$objective[2])

Do you have a reference for this computation (code tutorial, paper)?

A-Pai commented 2 years ago

@A-Pai

How do you come to this line when trying to compute the dissimilarity based on the cluster R package
cat("best_dissimilarity:", n * pm$objective[2])
Do you have a reference for this computation (code tutorial, paper)?

mlampros commented 2 years ago

the following is what I see when I specify the trace.level in the 'pam' function

> pm <- pam(x, k, metric = "euclidean", trace.lev = 2)
C pam(): computing 4951 dissimilarities from  100 x 2  matrix: [Ok]
pam()'s bswap(*, s=2.59294, pamonce=0): build 5 medoids:
    new repr. 79
    new repr. 25
    new repr. 57
    new repr. 45
    new repr. 63
  after build: medoids are  25  45  57  63  79
   swp new  11 <->  25 old; decreasing diss. 22.5013 by -0.73579
   swp new  72 <->  57 old; decreasing diss. 21.7655 by -0.73491
   swp new  90 <->  79 old; decreasing diss. 21.0306 by -0.0853289
   swp new  70 <->  63 old; decreasing diss. 20.9453 by -0.245073
   swp new  75 <->  72 old; decreasing diss. 20.7002 by -0.0885502
end{bswap()}, end{cstat()}

I don't think that the way you compute the dissimilarity is correct because you multiply the number of observations (100 in your input data) with the output objective for the swap phase. Whereas the author of the cluster package (based on the 'trace.level' parameter) moves from the build phase objective (22.5013) to the swap phase (20.7002) by replacing medoids so that the dissimilarity is decreased. Your computed dissimilarity might be close to the output of the pam() function but it's not the same.

mlampros commented 2 years ago

I asked about your output dissimilarity because I currently work in the new build phase of the ClusterR package and in the next days I'll update the code

mlampros commented 1 year ago

since the time you opened this issue I had to

go through the cluster medoids code once again
find a source to come to the exact cluster medoids algorithm (one of these were this .pdf file which I also include in the _/inst/papersreferences of the ClusterR package)
compare my implementation with existing R package implementations

In my available time I re-implemented the updated_BUILD function (the 'build' phase of the algorithm) where I observed that there are no differences compared to my existing one

The issue that you opened allowed me to compare my implementation with existing ones and (as I mention in my initial blog-post) I understood that it belongs to the approximate 'Partition around Medoids' implementations which I explain in my latest blog-post. I hope this information answers your question regarding the differences in the dissimilarity cost.

Moreover, in the latest version of the ClusterR package (which I submitted yesterday) I added the ClusterR::cost_clusters_from_dissim_medoids() function that allows a user to receive the dissimilarity cost when using as input a distance matrix and the output medoids of any of the existing implementations.

A-Pai commented 1 year ago

您的邮件我已收到 I have received your email 李刚 Li Gang

mlampros / ClusterR

The solution of package ClusterR is not best #39