Closed mcoussin closed 11 months ago
Hi @mcoussin,
I do not understand the problem here (my bad, maybe you can explain a little more) but it seems to me like the silhouette value for the entire cluster model can differ from the silhouette scores for the individual clusters. Is this incorrect?
Hi @mcoussin,
I do not understand the problem here (my bad, maybe you can explain a little more) but it seems to me like the silhouette value for the entire cluster model can differ from the silhouette scores for the individual clusters. Is this incorrect?
Hi koenderks,
I don't know how the Silhouette is measured in JASP, but in principle the two numbers should be equal. To calculate the Silhouette index, you average the Silhouettes per observation (basically, the gap between the distance to observations in the nearest cluster and the distance to observations in the cluster to which the observation belongs). The overall Silhouette is then calculated by averaging, and is used as a measure of cluster reliability. What the clustering algo does, in theory, is to estimate clusters with different numbers of clusters k, calculate the Silhouette for each k, then select the number of clusters maximizing the Silhouette. So, in theory, the Silhouette for the selected number k should be the same as the Silhouette in the "Cluster Information" table.
It's possible (but I'm not at all sure) that this difference is due to the cluster determination algorithm. In particular, the MacQueen and LLoyd algos are sensitive to the choice of centroids. If the algo re-estimates the clusters after choosing k, then the Silhouette may differ.
We calculate the silhouette scores in https://github.com/jasp-stats/jaspMachineLearning/blob/master/R/mlClusteringKMeans.R using the silhouette()
function from the cluster
package, specifically we do:
silhouettes <- summary(cluster::silhouette(predictions, .mlClusteringCalculateDistances(dataset[, options[["predictors"]]])))
Where . mlClusteringCalculateDistances()
is
.mlClusteringCalculateDistances <- function(x) {
p <- try({
distx <- dist(x) # This scales terribly in terms of memory (O(n^2))
})
if (isTryError(p) && "allocate" %in% strsplit(x = p[[1]], split = " ")[[1]]) {
jaspBase:::.quitAnalysis(gettextf("Insufficient RAM available to compute the distance matrix. The analysis tried to allocate %s Gb", .extractMemSizeFromError(p)))
} else if (isTryError(p)) {
jaspBase:::.quitAnalysis(gettextf("An error occurred in the analysis: %1$s", .extractErrorMessage(p)))
}
return(distx)
}
From these lines
result[["Silh_score"]] <- silhouettes[["avg.width"]]
result[["silh_scores"]] <- silhouettes[["clus.avg.widths"]]
it seems that the silhouette score in the main table is the average width and the silhouette scores in the cluster information table are the cluster average widths.
Does this information help? I'll do some more digging.
Ok! after digging into the function, I realize that I had misunderstood the table... It's my bad, I'm really sorry. I thought the "Cluster Information" table was the model performance for different number k, but it's the average Silhouette for the observations of one cluster among k, indeed.
The Silhouette displayed in the "K-Means Clustering" table is therefore a weighted average of the two Silhouettes per cluster, yep.
Sorry again and thanks for your help.
No worries, glad I could help! I guess it could be made clearer in the help file, I'll make a note :)
JASP Version
0.18.1
Commit ID
No response
JASP Module
Machine Learning
What analysis are you seeing the problem on?
Neighborhood-based clustering
What OS are you seeing the problem on?
Windows 11
Bug Description
There seems to be an inconsistency between the silhouette score in the clustering summary table and the values taken by the silhouettes in the "Cluster Information" table, used in part to select the right number of clusters.
Expected Behaviour
The two values should be the same.
Steps to Reproduce
Log (if any)
No response
Final Checklist