Closed A-Pai closed 1 year ago
@A-Pai your results seem to be the same with the "cluster::pam()" function when the swap-phase is TRUE. I can reproduce this using the following code,
library(ClusterR)
n <- 100 # data size
k <- 2
set.seed(3)
x <- rbind(
matrix(rnorm(n, sd = 0.25), ncol = 2),
matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2)
)
dim(x)
# swap-phase = FALSE
cm = Cluster_Medoids(x, k, distance_metric = "euclidean", swap_phase = F) # ClusterR
cm_1 = cluster::pam(x = x, k = k, metric = 'euclidean', do.swap = F) # cluster
all(cm$medoid_indices == cm_1$id.med)
# TRUE
# swap-phase = TRUE
cm_sw = Cluster_Medoids(x, k, distance_metric = "euclidean", swap_phase = T) # ClusterR
cm_sw_1 = cluster::pam(x = x, k = k, metric = 'euclidean', do.swap = T) # cluster
all(cm_sw$medoid_indices == cm_sw_1$id.med)
# FALSE
Thus, for your edge case there seems to be a bug in the ClusterR::medoids() function that does not return the optimal medoids when the swap phase is TRUE. Let me have a look in the next days with other datasets as well.
It's not optimal in this case either(compared with package cluster):
n <- 100 # data size set.seed(3) x <- rbind( matrix(rnorm(n, sd = 0.25), ncol = 2), matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2) )
k <- 5
library(ClusterR) cat("\n") cm <- Cluster_Medoids(x, k, distance_metric = "euclidean") cat("package ClusterR solutuion") cat("\n") cat("medoid_indices:", sort(cm$medoid_indices)) cat("\n") cat("best_dissimilarity:", cm$best_dissimilarity)
library(cluster) pm <- pam(x, k, metric = "euclidean") cat("\n") cat("\n") cat("package cluster solutuion") cat("\n") cat("medoid_indices:", sort(pm$id.med)) cat("\n") cat("best_dissimilarity:", n * pm$objective[2])
@A-Pai thanks for making me aware of this issue. I used also other datasets (besides your sample) data and there are differences in the results between the 'ClusterR::Cluster_Medoids()' and other R package K-medoids implementations. There is no quick fix for this, I have to restructure and update the code. I'll do it as soon as possible depending on my available time.
@A-Pai
How do you come to this line when trying to compute the dissimilarity based on the cluster
R package
cat("best_dissimilarity:", n * pm$objective[2])
Do you have a reference for this computation (code tutorial, paper)?
@A-Pai
How do you come to this line when trying to compute the dissimilarity based on the
cluster
R packagecat("best_dissimilarity:", n * pm$objective[2])
Do you have a reference for this computation (code tutorial, paper)?
the following is what I see when I specify the trace.level in the 'pam' function
> pm <- pam(x, k, metric = "euclidean", trace.lev = 2)
C pam(): computing 4951 dissimilarities from 100 x 2 matrix: [Ok]
pam()'s bswap(*, s=2.59294, pamonce=0): build 5 medoids:
new repr. 79
new repr. 25
new repr. 57
new repr. 45
new repr. 63
after build: medoids are 25 45 57 63 79
swp new 11 <-> 25 old; decreasing diss. 22.5013 by -0.73579
swp new 72 <-> 57 old; decreasing diss. 21.7655 by -0.73491
swp new 90 <-> 79 old; decreasing diss. 21.0306 by -0.0853289
swp new 70 <-> 63 old; decreasing diss. 20.9453 by -0.245073
swp new 75 <-> 72 old; decreasing diss. 20.7002 by -0.0885502
end{bswap()}, end{cstat()}
I don't think that the way you compute the dissimilarity is correct because you multiply the number of observations (100 in your input data) with the output objective for the swap phase. Whereas the author of the cluster package (based on the 'trace.level' parameter) moves from the build phase objective (22.5013) to the swap phase (20.7002) by replacing medoids so that the dissimilarity is decreased. Your computed dissimilarity might be close to the output of the pam() function but it's not the same.
I asked about your output dissimilarity because I currently work in the new build phase of the ClusterR package and in the next days I'll update the code
since the time you opened this issue I had to
In my available time I re-implemented the updated_BUILD function (the 'build' phase of the algorithm) where I observed that there are no differences compared to my existing one
The issue that you opened allowed me to compare my implementation with existing ones and (as I mention in my initial blog-post) I understood that it belongs to the approximate 'Partition around Medoids' implementations which I explain in my latest blog-post. I hope this information answers your question regarding the differences in the dissimilarity cost.
Moreover, in the latest version of the ClusterR package (which I submitted yesterday) I added the ClusterR::cost_clusters_from_dissim_medoids() function that allows a user to receive the dissimilarity cost when using as input a distance matrix and the output medoids of any of the existing implementations.
您的邮件我已收到 I have received your email 李刚 Li Gang
The solution of package ClusterR is not best:
library(ClusterR)
n <- 100 # data size k <- 2 set.seed(3) x <- rbind( matrix(rnorm(n, sd = 0.25), ncol = 2), matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2) ) ds <- distance_matrix(x, method = "euclidean", upper = TRUE, diagonal = TRUE)
cm <- Cluster_Medoids(x, k, distance_metric = "euclidean") cat("package ClusterR solutuion") cat("\n") cat("medoid_indices:", cm$medoid_indices) cat("\n") cat("best_dissimilarity:", cm$best_dissimilarity)
s <- c(20, 57) # my medoid_indices b <- ds[, s[1]] < ds[, s[2]] cat("\n") cat("\n") cat("my solutuion") cat("\n") cat("medoid_indices:", s) cat("\n") cat("best_dissimilarity:", sum(ds[b, s[1]]) + sum(ds[!b, s[2]]))
library(cluster) pm <- pam(x, k, metric = "euclidean") cat("\n") cat("\n") cat("package cluster solutuion") cat("\n") cat("medoid_indices:", pm$id.med) cat("\n") cat("best_dissimilarity:", n * pm$objective[2])