of2 / clusterBMA

clusterBMA: Bayesian Model Averaging for Clustering
13 stars 1 forks source link

Clarification regarding factors influencing [final] number of clusters? #2

Closed jlhanson5 closed 9 months ago

jlhanson5 commented 1 year ago

Hello colleagues using clusterBMA,

I had a question about factors that might influence the final number of clusters output from clusterBMA. Basically, I always get 5 clusters regardless of how much/which data I put in. To set the stage, I run these R commands:

clusterBMA::clusterBMA_use_condaenv()
#
library(clusterBMA)
library(clusterGeneration)
library(ClusterR)
library(plotly)
#
# k-means clustering
test_kmeans <- kmeans(test_data,centers = 20)
km_labs <- test_kmeans$cluster
km_probs <- hard_to_prob_fn(km_labs,n_clust=20)

# hierarchical clustering
hca <- hclust(dist(test_data), method = "ward.D2")
plot(hca)
clusterCut <- cutree(hca, 20)
hc_probs <- hard_to_prob_fn(clusterCut,n_clust=20)

# gaussian mixture modeling
test_gmm <- ClusterR::GMM(test_data,gaussian_comps=20)
test_gmm_predict <- ClusterR::predict_GMM(test_data,CENTROIDS=test_gmm$centroids,COVARIANCE = test_gmm$covariance_matrices,WEIGHTS=test_gmm$weights)
gmm_probs <- test_gmm_predict$cluster_proba

# Cluster_Medoids
cm = ClusterR::Cluster_Medoids(test_data, clusters = 20, distance_metric = 'euclidean', swap_phase = TRUE)
cm_labs<-cm$clusters
cm_probs <- hard_to_prob_fn(cm_labs,n_clust=20)

# Clara_Medoids
clm = Clara_Medoids(test_data, clusters = 20, samples = 50, sample_size = 0.25, swap_phase = TRUE)
clm_labs<-clm$clusters
clm_probs <- hard_to_prob_fn(clm_labs,n_clust=20)

# Put cluster allocation probability matrices into list, format required for function clusterBMA::clusterBMA()
input_probs <- list(km_probs,hc_probs,gmm_probs,cm_probs,clm_probs)
test_bma_results <- clusterBMA(input_data = test_data, cluster_prob_matrices = input_probs, n_final_clust = 20) 

# RESULTS
test_consensus_matrix <- test_bma_results[[1]] # consensus matrix
test_bma_allocation_probs <- test_bma_results[[2]] # probs of cluster allocation after BMA
test_bma_cluster_labels_df <- test_bma_results[[3]] # cluster allocations with probability and uncertainty
test_bma_table <- test_bma_results[[4]] # table - how many in each cluster?
test_bma_weights <- test_bma_results[[5]] # weights for each algo
test_bma_weights_times_priors <- test_bma_results[[6]] # weights multiplied by prior probabilities - should be the same as output [5] if prior model weights are set to be equal (default)
test_bma_consensus_heatmap <- test_bma_results[[7]] # heatmap of consensus matrix (output [1])

#
HBN_vars$BMA_clusters<-factor(test_bma_cluster_labels_df$alloc_ordered)

And here's the start of my R sessionInfo

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

Obviously, above I specify a few different clustering algorithms and use many more potential groups for each algorithm (in the code: 20); however, even if I change the number of potential groups (or the data), I basically get similar numbers of clusters. So I wondered if there might be something else that was explaining why I keep getting similar answers from clusterBMA? I wanted to make sure there wasn't an issue in my code, or something else that I was doing related to the consistent clustering solutions.

Any thoughts are much appreciated, and thanks much! Jamie.

jlhanson5 commented 10 months ago

[bump]

of2 commented 10 months ago

Hi Jamie, Apologies for the delay in responding - I have been on a long break and just looking at Github for the first time in a while.

Am I right in thinking that in the situation you describe above, you have followed the steps in the vignette and the following line is used to generate the object test_data?

simulate data

test_data <- clusterGeneration::genRandomClust(numClust=5,sepVal=0.15,numNonNoisy=2,numNoisy=0,clustSizes=c(rep(100,5)),numReplicate = 1,clustszind = 3)

In this case, the true number of simulated clusters here is 5. If you change this line to e.g.

test_data_3 <- clusterGeneration::genRandomClust(numClust=3,sepVal=0.15,numNonNoisy=2,numNoisy=0,clustSizes=c(rep(100,3)),numReplicate = 1,clustszind = 3)

then I would expect that you should get a clusterBMA solution with k=3, even with individual algorithms solutions of k=20?

To get from the consensus matrix to clusterBMA allocations, the matrix factorisation step includes L2 regularisation which serves to reduce/empty any redundant clusters, which is a designed feature by the authors of this piece of the method (Duan and Dunson). If I have understood your situation correctly, this is likely why you are ending up with 5 clusters in the BMA solution since using k=20 for each clustering approach will include a lot of redundant clusters which are not agreed upon across the input solutions.

There are more details in the clusterBMA paper and the Duan & Dunson matrix factorisation papers:

clusterBMA paper https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0288000#pone.0288000.ref029

Duan & Dunson - details on matrix factorisation w/ L2 regularisation https://www.jmlr.org/papers/volume21/19-239/19-239.pdf Duan LL. Latent Simplex Position Model: High Dimensional Multi-view Clustering with Uncertainty Quantification. Journal of Machine Learning Research. 2020;21:38–1.

Can you give more detail on what you said here about the issue occurring across different datasets? "even if I change the number of potential groups (or the data), I basically get similar numbers of clusters"

In my experience, our implementation of clusterBMA tends to be responsive to differing cluster structure across different datasets and I have not experienced the tendency you describe of getting the same number of clusters (5) across different datasets.

jlhanson5 commented 9 months ago

Hello,

Pardon my delay in responding... I have worked through the vignette and things work (to my knowledge). However, with real data, I really get similar solutions (always final k=5). This is even when I vary some measures/input variables and/or the number of max k for a different algorithms, etc.

How best to proceed? I can send along some code, sample data, etc. But I noticed it a few times and couldn't figure what was up... so any thoughts or guidance is appreciated. Perhaps the data just clusters that way, but I would be slightly surprised by that?

jlhanson5 commented 9 months ago

Just to follow-up and as an example: -I input 6 variables, and used 5 cluster algorithms to average... and got 5 groups -I input 5 variables, and used 5 cluster algorithms to average... and got 5 groups -I input 5 variables, and used 4 cluster algorithms to average... and got 5 groups

Those were all with scaled data, so then I input 5 variables and used 4 cluster algorithms to average (non scaled)... and got 5 groups. I'm sure that there's some similarity in the data, so some similar solutions feels reasonable... but I'm surprised that nothing really bumps things into more (or less) # of groups.

Thoughts? I have attached an R workspace zipped here.

of2 commented 9 months ago

Hi @jlhanson5, thanks for your patience with this. I found a bug leftover from testing/development in consensus_matrix_fn() manually setting the number of clusters to 5. Because this is called nested within other functions I hadn't spotted it on previous inspection.

Initially I couldn't reproduce your issue as when testing with K < 5, the L2 regularisation was emptying redundant clusters and giving me the number I expected. However I noticed the same issue as you when testing with simulated K > 5, and found this issue.

Thank you for pointing it out this crucial bug, and my apologies for the long delay in fixing it!

of2 commented 9 months ago

Closing this now

jlhanson5 commented 9 months ago

Ah perfect! Yes, I ran it this morning and got 20 clusters 🤣 🤣 🤣 Thanks for this help!