jasp-stats / jasp-issues

This repository is solely meant for reporting of bugs, feature requests and other issues in JASP.
59 stars 29 forks source link

[Bug]: Silhouette score in Neighborhood-based clustering #2424

Closed mcoussin closed 11 months ago

mcoussin commented 11 months ago

JASP Version

0.18.1

Commit ID

No response

JASP Module

Machine Learning

What analysis are you seeing the problem on?

Neighborhood-based clustering

What OS are you seeing the problem on?

Windows 11

Bug Description

There seems to be an inconsistency between the silhouette score in the clustering summary table and the values taken by the silhouettes in the "Cluster Information" table, used in part to select the right number of clusters.

image

Expected Behaviour

The two values should be the same.

Steps to Reproduce

  1. Run a Neighborhood clustering.
  2. Display the silhouette score

Log (if any)

No response

Final Checklist

koenderks commented 11 months ago

Hi @mcoussin,

I do not understand the problem here (my bad, maybe you can explain a little more) but it seems to me like the silhouette value for the entire cluster model can differ from the silhouette scores for the individual clusters. Is this incorrect?

mcoussin commented 11 months ago

Hi @mcoussin,

I do not understand the problem here (my bad, maybe you can explain a little more) but it seems to me like the silhouette value for the entire cluster model can differ from the silhouette scores for the individual clusters. Is this incorrect?

Hi koenderks,

I don't know how the Silhouette is measured in JASP, but in principle the two numbers should be equal. To calculate the Silhouette index, you average the Silhouettes per observation (basically, the gap between the distance to observations in the nearest cluster and the distance to observations in the cluster to which the observation belongs). The overall Silhouette is then calculated by averaging, and is used as a measure of cluster reliability. What the clustering algo does, in theory, is to estimate clusters with different numbers of clusters k, calculate the Silhouette for each k, then select the number of clusters maximizing the Silhouette. So, in theory, the Silhouette for the selected number k should be the same as the Silhouette in the "Cluster Information" table.

It's possible (but I'm not at all sure) that this difference is due to the cluster determination algorithm. In particular, the MacQueen and LLoyd algos are sensitive to the choice of centroids. If the algo re-estimates the clusters after choosing k, then the Silhouette may differ.

koenderks commented 11 months ago

We calculate the silhouette scores in https://github.com/jasp-stats/jaspMachineLearning/blob/master/R/mlClusteringKMeans.R using the silhouette() function from the cluster package, specifically we do:

silhouettes <- summary(cluster::silhouette(predictions, .mlClusteringCalculateDistances(dataset[, options[["predictors"]]])))

Where . mlClusteringCalculateDistances() is

.mlClusteringCalculateDistances <- function(x) {
  p <- try({
    distx <- dist(x) # This scales terribly in terms of memory (O(n^2))
  })
  if (isTryError(p) && "allocate" %in% strsplit(x = p[[1]], split = " ")[[1]]) {
    jaspBase:::.quitAnalysis(gettextf("Insufficient RAM available to compute the distance matrix. The analysis tried to allocate %s Gb", .extractMemSizeFromError(p)))
  } else if (isTryError(p)) {
    jaspBase:::.quitAnalysis(gettextf("An error occurred in the analysis: %1$s", .extractErrorMessage(p)))
  }
  return(distx)
}

From these lines

result[["Silh_score"]] <- silhouettes[["avg.width"]]
result[["silh_scores"]] <- silhouettes[["clus.avg.widths"]]

it seems that the silhouette score in the main table is the average width and the silhouette scores in the cluster information table are the cluster average widths.

Does this information help? I'll do some more digging.

mcoussin commented 11 months ago

Ok! after digging into the function, I realize that I had misunderstood the table... It's my bad, I'm really sorry. I thought the "Cluster Information" table was the model performance for different number k, but it's the average Silhouette for the observations of one cluster among k, indeed.

The Silhouette displayed in the "K-Means Clustering" table is therefore a weighted average of the two Silhouettes per cluster, yep.

Sorry again and thanks for your help.

koenderks commented 11 months ago

No worries, glad I could help! I guess it could be made clearer in the help file, I'll make a note :)