Closed daniepi closed 1 year ago
@daniepi I'm sorry for not replying earlier. I'll try to reproduce your issue using ClusterR version 1.2.9 and gtools version 3.9.4 (initially I received an error because I had to load the ClusterR package otherwise the 'gtools' R package (which is called in the Rcpp code) doesn't work,
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
require(ClusterR)
silh_score <- ClusterR::Optimal_Clusters_KMeans(data = iris,
max_clusters = 1:5,
num_init = 3,
max_iters = 10,
criterion = "silhouette",
initializer = "kmeans++",
plot_clusters = TRUE)
give me a few days to look into this
Hi @mlampros, no worries. I have version ClusterR=1.2.7
and gtools=3.9.3
.
@daniepi after looking once again into the code it seems the 'Optimal_Clusters_KMeans()' function is not the one that you have to use for your purpose.
The silhouette metric of the 'Optimal_Clusters_KMeans()' function averages all the computed silhouette values of all clusters for each iteration. That means, if you intend to compute the optimal clusters 2 to 5 for the iris dataset then the silhouette metric - lets say for cluster 2 - will return initially 2 values (a silhouette for cluster 1 and a silhouette for cluster 2) and then due to the fact that the 'Optimal_Clusters_KMeans()' function has to return a single value for each iteration (from 2 to 5) you will simply receive the average of the values that correspond to each cluster iteration, i.e. for cluster 2 the average of the 2 silhouette values, for cluster 3 the average of the 3 silhouette values etc.
I included the silhouette_of_clusters function which has similar functionality to the cluster::silhouette()
function. For example,
require(ClusterR)
data(dietary_survey_IBS)
dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
dat = center_scale(dat)
clusters = 2
set_seed = 1
# compute k-means
km = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
# compute the silhouette width
silh_km = silhouette_of_clusters(data = dat, clusters = km$clusters)
str(silh_km$silhouette_matrix)
# num [1:400, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:3] "cluster" "intra_cluster_dissim" "silhouette"
silh_km$silhouette_summary
# cluster size avg_intra_dissim avg_silhouette
# 1 1 200 4.187004 0.60082682
# 2 2 200 9.750614 0.06605173
# compare results with the 'cluster' R package
diss = ClusterR::distance_matrix(dat, method = "euclidean", upper = TRUE, diagonal = TRUE, threads = 1L)
set.seed(seed = set_seed)
silh = cluster::silhouette(x = km$clusters, dist = diss)
summary(silh)
# Silhouette of 400 units in 2 clusters from silhouette.default(x = km$clusters, dist = diss) :
# Cluster sizes and average silhouette widths:
# 200 200
# 0.60082682 0.06605173
# Individual silhouette widths:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.09454 0.07441 0.35524 0.33344 0.60343 0.66267
# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)
clusters = 3
set_seed = 1
# compute k-means
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
silh_km = silhouette_of_clusters(data = mt, clusters = km$clusters)
str(silh_km$silhouette_matrix)
# num [1:150, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:3] "cluster" "intra_cluster_dissim" "silhouette"
silh_km$silhouette_summary
# cluster size avg_intra_dissim avg_silhouette
# 1 1 51 1.327643 0.3304151
# 2 2 50 1.175155 0.6331807
# 3 3 49 1.168557 0.4078544
diss = ClusterR::distance_matrix(mt, method = "euclidean", upper = TRUE, diagonal = TRUE, threads = 1L)
silh = cluster::silhouette(x = km$clusters, dist = diss)
summary(silh)
# Silhouette of 150 units in 3 clusters from silhouette.default(x = km$clusters, dist = diss) :
# Cluster sizes and average silhouette widths:
# 51 50 49
# 0.3304151 0.6331807 0.4078544
# Individual silhouette widths:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.05003 0.35959 0.47506 0.45663 0.59352 0.73251
If you see the silhouette_summary
returns the same values as the summary(silh)
of the cluster
R package
You can install the updated version using
remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')
I'll update the ClusterR package on CRAN probably next week because I have to modify a few more functions (not related to your issue)
regarding performance I used the microbenchmark R package and it seems that the cluster::silhouette()
function performs better taking advantage of the pre-computed distance matrix
require(ClusterR)
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)
clusters = 3
set_seed = 1
# compute k-means
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
# 150 observations
microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")))
# Unit: microseconds
# expr min lq mean median uq max neval
# silhouette_of_clusters(mt, km$clusters) 1731.439 1832.6275 1890.0648 1862.8150 1903.6090 2333.522 100
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")) 284.070 301.1695 316.1756 309.3755 326.7565 474.051 100
clusters = 4
COLS = 5
# 3000 observations
ROWS = 3000
mt = matrix(runif(n = ROWS * COLS), nrow = ROWS, ncol = COLS)
dim(mt)
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# silhouette_of_clusters(mt, km$clusters) 558.8527 561.2622 563.9357 562.5054 563.6190 581.7394 10
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")) 158.9886 160.8787 369.5834 366.5273 571.2039 588.6946 10
# 10000 observations
ROWS = 10000
mt = matrix(runif(n = ROWS * COLS), nrow = ROWS, ncol = COLS)
dim(mt)
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# silhouette_of_clusters(mt, km$clusters) 6.253863 6.261514 6.279802 6.289321 6.291960 6.297662 10
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")) 1.590505 1.601715 1.683070 1.602795 1.610305 2.022306 10
Hi @mlampros . Maybe I have misunderstood definition of average silhouette score. If I understood definition behind Optimal_Clusters_KMeans()
, it calculates average silhouette scores for given k-clusters (i.e. averaging over sample silhouette scores for observations in that cluster). Then it takes those k-values and average them again to get one value.
However, I think the proper definition of average silhouette score is to calculate sample silhoutte scores for all records (same as above) and then calculate grand mean.
Example:
silh_sample <- c(rep(0.1, 10), rep(0.9, 90))
mean(silh_sample) # this is what I understood as average silhouette score
0.82
# If first 10 observations are cluster 1 and rest 90 are cluster 2 then Optimal_Clusters_KMeans() does
mean(c(0,1, 0.9))
0.63
It should be the grand mean of inidividual silhouette scores that is average silhouette score.
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
cuML
definition
https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.cluster.silhouette_score.cython_silhouette_score
sklearn
definition:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score
In other words, if clusters are of quite different sizes and intra-cluster average silhouette scores vary a lot, these two approaches will give two quite different scores.
yes that's true. I think that by having now the silhouette_of_clusters()
function the average silhouette per cluster can be computed and then the average of silhouette of all cluster means. The first output values will be always 0.0 in the Optimal_Clusters_KMeans()
function because the code is structured to return the 'silhouette' only for the case of clusters > 1
now the function returns similar results to your mentioned 'sklearn.silhouette_score' because internally the Optimal_Clusters_KMeans()
function will use the the silhouette_summary
of the silhouette_of_clusters()
function
require(ClusterR)
# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)
clusters = 3
set_seed = 1
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
# requires the reticulate R package
skl = reticulate::import('sklearn.metrics')
skl_score = skl$silhouette_score(X = mt, labels = km$clusters)
skl_score
# 0.4566338
silh_score = silhouette_of_clusters(data = mt, clusters = km$clusters)
silh_score$silhouette_summary
mean(silh_score$silhouette_summary$avg_silhouette)
# 0.4571501
In case one wants to use the official average silhouette score definition (as used by sklearn
or cuML
), then it can't first calculate means for each cluster and then mean of those means.
It just takes all the silhouette scores for all samples and calculates the mean of these individual scores.
I assume vec_out[COUNT] = mean(silh_summary$avg_silhouette)
is a mean of means (given $avg_silhouette
). Hence numbers between ClusterR
and sklearn
do not match.
thanks for making me aware of this issue. the silhouette criterion for the optimal-clusters-kmeans was one of these metrics that I didn't have a reference. I added the 'silhouette_global_average' value to the output and the following example shows now that the results match,
require(ClusterR)
# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)
clusters = 3
set_seed = 1
# compute k-means
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
silh = silhouette_of_clusters(mt, km$clusters)
glob_avg = silh$silhouette_global_average
silh_score <- ClusterR::Optimal_Clusters_KMeans(data = mt,
max_clusters = 5L,
num_init = 5L,
max_iters = 100L,
criterion = "silhouette",
initializer = "kmeans++",
plot_clusters = TRUE)
# requires the reticulate R package
skl = reticulate::import('sklearn.metrics')
skl_score = skl$silhouette_score(X = mt, labels = km$clusters, metric='euclidean')
skl_score
# 0.4566338
silh_score[clusters] == glob_avg
# [1] TRUE
all.equal(glob_avg, skl_score, tolerance = sqrt(.Machine$double.eps))
# [1] TRUE
feel free to close the issue if the code now works as expected. I'll upload an updated version to CRAN next week
Looks good. Thanks for looking into this.
Hi, I am struggling to reproduce output of
Optimal_Clusters_KMeans
withcriterion = "silhouette"
. For my dataset, the differences are quite big. It first peaks atk=2
, then slightly drops and increases again peaking even higher atk=7
. My dataset is ~1.4M records, so I runOptimal_Clusters_KMeans
on subsets of data (31 exclusive subsets) and run in parallel. Hence the plot.However if cluster with
KMeans_rcpp
, calculate distances withdistance_matrix
and calculate silhouette withcluster::silhouette
I get very different results suggesting thatk=2
is by far the 'best' choice, peaking on average at0.55
average silhouette width. I ran this only for 10 subsets due to memory usage.This more or less matches results I see using
cuML
'sKMeans
andsilhouette_score
. Usinginit=k-means++
. Run on 300k sample.Example with
Iris
:With
Iris
the differences are not that significant as with my dataset. I get similar differences usinginitializer = "random"
. Appreciate any input. Thanks.R session: