Cannot reproduce Optimal_Clusters_KMeans with criterion = "silhouette"

daniepi commented 1 year ago

Hi, I am struggling to reproduce output of Optimal_Clusters_KMeans with criterion = "silhouette". For my dataset, the differences are quite big. It first peaks at k=2, then slightly drops and increases again peaking even higher at k=7. My dataset is ~1.4M records, so I run Optimal_Clusters_KMeans on subsets of data (31 exclusive subsets) and run in parallel. Hence the plot.

However if cluster with KMeans_rcpp, calculate distances with distance_matrix and calculate silhouette with cluster::silhouette I get very different results suggesting that k=2 is by far the 'best' choice, peaking on average at 0.55 average silhouette width. I ran this only for 10 subsets due to memory usage.

This more or less matches results I see using cuML's KMeans and silhouette_score. Using init=k-means++. Run on 300k sample.

Example with Iris:

silh_score <- ClusterR::Optimal_Clusters_KMeans(data = data[, !"Species"], max_clusters = 5L, num_init = 3L,
                                      max_iters = 10L, criterion = "silhouette", initializer = "kmeans++",
                                      plot_clusters = TRUE)
silh_score
[1] 0.0000000 0.7011138 0.5522564 0.4675412 0.3755630
# Using KMeans_rcpp
km_rcpp <- ClusterR::KMeans_rcpp(data[, !"Species"], clusters = 2L, num_init = 3L, max_iters = 10L, initializer = "kmeans++")
diss <- ClusterR::distance_matrix(data[, !"Species"], threads = 1L)
silh <- cluster::silhouette(x = km_rcpp$clusters, dist = diss)
mean(silh[, 3])
# If I do this for k in 2:5
[1] 0.6810462 0.5511916 0.4974552 0.4252916

With Iris the differences are not that significant as with my dataset. I get similar differences using initializer = "random". Appreciate any input. Thanks.

R session:


R version 4.1.3 (2022-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.1.3 tools_4.1.3    renv_0.16.0
``

mlampros commented 1 year ago

@daniepi I'm sorry for not replying earlier. I'll try to reproduce your issue using ClusterR version 1.2.9 and gtools version 3.9.4 (initially I received an error because I had to load the ClusterR package otherwise the 'gtools' R package (which is called in the Rcpp code) doesn't work,



data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)

require(ClusterR)
silh_score <- ClusterR::Optimal_Clusters_KMeans(data = iris, 
                                                max_clusters = 1:5, 
                                                num_init = 3,
                                                max_iters = 10,
                                                criterion = "silhouette",
                                                initializer = "kmeans++",
                                                plot_clusters = TRUE)

give me a few days to look into this

daniepi commented 1 year ago

Hi @mlampros, no worries. I have version ClusterR=1.2.7 and gtools=3.9.3.

mlampros commented 1 year ago

@daniepi after looking once again into the code it seems the 'Optimal_Clusters_KMeans()' function is not the one that you have to use for your purpose.

The silhouette metric of the 'Optimal_Clusters_KMeans()' function averages all the computed silhouette values of all clusters for each iteration. That means, if you intend to compute the optimal clusters 2 to 5 for the iris dataset then the silhouette metric - lets say for cluster 2 - will return initially 2 values (a silhouette for cluster 1 and a silhouette for cluster 2) and then due to the fact that the 'Optimal_Clusters_KMeans()' function has to return a single value for each iteration (from 2 to 5) you will simply receive the average of the values that correspond to each cluster iteration, i.e. for cluster 2 the average of the 2 silhouette values, for cluster 3 the average of the 3 silhouette values etc.

I included the silhouette_of_clusters function which has similar functionality to the cluster::silhouette() function. For example,


require(ClusterR)

data(dietary_survey_IBS)
dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
dat = center_scale(dat)

clusters = 2
set_seed = 1

# compute k-means
km = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)

# compute the silhouette width
silh_km = silhouette_of_clusters(data = dat, clusters = km$clusters)
str(silh_km$silhouette_matrix)
# num [1:400, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:3] "cluster" "intra_cluster_dissim" "silhouette"
silh_km$silhouette_summary
# cluster size avg_intra_dissim avg_silhouette
# 1       1  200         4.187004     0.60082682
# 2       2  200         9.750614     0.06605173

# compare results with the 'cluster' R package
diss = ClusterR::distance_matrix(dat, method = "euclidean", upper = TRUE, diagonal = TRUE, threads = 1L)
set.seed(seed = set_seed)
silh = cluster::silhouette(x = km$clusters, dist = diss)
summary(silh)
# Silhouette of 400 units in 2 clusters from silhouette.default(x = km$clusters, dist = diss) :
# Cluster sizes and average silhouette widths:
#   200        200
# 0.60082682 0.06605173
# Individual silhouette widths:
#   Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
# -0.09454  0.07441  0.35524  0.33344  0.60343  0.66267

# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)

clusters = 3
set_seed = 1

# compute k-means
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)

silh_km = silhouette_of_clusters(data = mt, clusters = km$clusters)
str(silh_km$silhouette_matrix)
# num [1:150, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:3] "cluster" "intra_cluster_dissim" "silhouette"
silh_km$silhouette_summary
# cluster size avg_intra_dissim avg_silhouette
# 1       1   51         1.327643      0.3304151
# 2       2   50         1.175155      0.6331807
# 3       3   49         1.168557      0.4078544

diss = ClusterR::distance_matrix(mt, method = "euclidean", upper = TRUE, diagonal = TRUE, threads = 1L)
silh = cluster::silhouette(x = km$clusters, dist = diss)
summary(silh)
# Silhouette of 150 units in 3 clusters from silhouette.default(x = km$clusters, dist = diss) :
# Cluster sizes and average silhouette widths:
#   51        50        49
# 0.3304151 0.6331807 0.4078544
# Individual silhouette widths:
#   Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
# -0.05003  0.35959  0.47506  0.45663  0.59352  0.73251

If you see the silhouette_summary returns the same values as the summary(silh) of the cluster R package

You can install the updated version using

remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')

I'll update the ClusterR package on CRAN probably next week because I have to modify a few more functions (not related to your issue)

mlampros commented 1 year ago

regarding performance I used the microbenchmark R package and it seems that the cluster::silhouette() function performs better taking advantage of the pre-computed distance matrix

require(ClusterR)

data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)

clusters = 3
set_seed = 1

# compute k-means
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)

# 150 observations
microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")))
# Unit: microseconds
#                                                                             expr      min        lq      mean    median        uq      max neval
#                                            silhouette_of_clusters(mt, km$clusters) 1731.439 1832.6275 1890.0648 1862.8150 1903.6090 2333.522   100
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean"))  284.070  301.1695  316.1756  309.3755  326.7565  474.051   100

clusters = 4
COLS = 5

# 3000 observations
ROWS = 3000
mt = matrix(runif(n = ROWS * COLS), nrow = ROWS, ncol = COLS)
dim(mt)
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)

microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")), times = 10)
# Unit: milliseconds
#                                                                               expr      min       lq     mean   median       uq      max neval
#                                            silhouette_of_clusters(mt, km$clusters) 558.8527 561.2622 563.9357 562.5054 563.6190 581.7394    10
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")) 158.9886 160.8787 369.5834 366.5273 571.2039 588.6946    10

# 10000 observations
ROWS = 10000
mt = matrix(runif(n = ROWS * COLS), nrow = ROWS, ncol = COLS)
dim(mt)
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)

microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")), times = 10)
# Unit: seconds
#                                                                               expr      min       lq     mean   median       uq      max neval
#                                            silhouette_of_clusters(mt, km$clusters) 6.253863 6.261514 6.279802 6.289321 6.291960 6.297662    10
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")) 1.590505 1.601715 1.683070 1.602795 1.610305 2.022306    10

daniepi commented 1 year ago

Hi @mlampros . Maybe I have misunderstood definition of average silhouette score. If I understood definition behind Optimal_Clusters_KMeans(), it calculates average silhouette scores for given k-clusters (i.e. averaging over sample silhouette scores for observations in that cluster). Then it takes those k-values and average them again to get one value. However, I think the proper definition of average silhouette score is to calculate sample silhoutte scores for all records (same as above) and then calculate grand mean.

Example:

silh_sample <- c(rep(0.1, 10), rep(0.9, 90))
mean(silh_sample) # this is what I understood as average silhouette score
0.82
# If first 10 observations are cluster 1 and rest 90 are cluster 2 then Optimal_Clusters_KMeans() does
mean(c(0,1, 0.9))
0.63

It should be the grand mean of inidividual silhouette scores that is average silhouette score. https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html cuML definition https://docs.rapids.ai/api/cuml/stable/api.html#cuml.metrics.cluster.silhouette_score.cython_silhouette_score sklearn definition: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score

daniepi commented 1 year ago

In other words, if clusters are of quite different sizes and intra-cluster average silhouette scores vary a lot, these two approaches will give two quite different scores.

mlampros commented 1 year ago

yes that's true. I think that by having now the silhouette_of_clusters() function the average silhouette per cluster can be computed and then the average of silhouette of all cluster means. The first output values will be always 0.0 in the Optimal_Clusters_KMeans() function because the code is structured to return the 'silhouette' only for the case of clusters > 1

mlampros commented 1 year ago

now the function returns similar results to your mentioned 'sklearn.silhouette_score' because internally the Optimal_Clusters_KMeans() function will use the the silhouette_summary of the silhouette_of_clusters() function


require(ClusterR)

# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)

clusters = 3
set_seed = 1
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)

# requires the reticulate R package
skl = reticulate::import('sklearn.metrics')
skl_score = skl$silhouette_score(X = mt, labels = km$clusters)
skl_score
# 0.4566338

silh_score = silhouette_of_clusters(data = mt, clusters = km$clusters)
silh_score$silhouette_summary

mean(silh_score$silhouette_summary$avg_silhouette)
# 0.4571501

daniepi commented 1 year ago

In case one wants to use the official average silhouette score definition (as used by sklearn or cuML), then it can't first calculate means for each cluster and then mean of those means. It just takes all the silhouette scores for all samples and calculates the mean of these individual scores. I assume vec_out[COUNT] = mean(silh_summary$avg_silhouette) is a mean of means (given $avg_silhouette). Hence numbers between ClusterR and sklearn do not match.

mlampros commented 1 year ago

thanks for making me aware of this issue. the silhouette criterion for the optimal-clusters-kmeans was one of these metrics that I didn't have a reference. I added the 'silhouette_global_average' value to the output and the following example shows now that the results match,


require(ClusterR)

# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)

clusters = 3
set_seed = 1

# compute k-means
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)

silh = silhouette_of_clusters(mt, km$clusters)
glob_avg = silh$silhouette_global_average

silh_score <- ClusterR::Optimal_Clusters_KMeans(data = mt, 
                                                max_clusters = 5L, 
                                                num_init = 5L,
                                                max_iters = 100L, 
                                                criterion = "silhouette",
                                                initializer = "kmeans++",
                                                plot_clusters = TRUE)
# requires the reticulate R package
skl = reticulate::import('sklearn.metrics')
skl_score = skl$silhouette_score(X = mt, labels = km$clusters, metric='euclidean')
skl_score
# 0.4566338

silh_score[clusters] == glob_avg
# [1] TRUE
all.equal(glob_avg, skl_score, tolerance = sqrt(.Machine$double.eps))
# [1] TRUE

feel free to close the issue if the code now works as expected. I'll upload an updated version to CRAN next week

daniepi commented 1 year ago

Looks good. Thanks for looking into this.

mlampros / ClusterR

Cannot reproduce Optimal_Clusters_KMeans with criterion = "silhouette" #42