of2 / clusterBMA

clusterBMA: Bayesian Model Averaging for Clustering
13 stars 1 forks source link

Clarification regarding factors influencing [final] number of clusters? #2

Closed jlhanson5 closed 9 months ago

jlhanson5 commented 1 year ago

Hello colleagues using clusterBMA,

I had a question about factors that might influence the final number of clusters output from clusterBMA. Basically, I always get 5 clusters regardless of how much/which data I put in. To set the stage, I run these R commands:

# k-means clustering
test_kmeans <- kmeans(test_data,centers = 20)
km_labs <- test_kmeans$cluster
km_probs <- hard_to_prob_fn(km_labs,n_clust=20)

# hierarchical clustering
hca <- hclust(dist(test_data), method = "ward.D2")
clusterCut <- cutree(hca, 20)
hc_probs <- hard_to_prob_fn(clusterCut,n_clust=20)

# gaussian mixture modeling
test_gmm <- ClusterR::GMM(test_data,gaussian_comps=20)
test_gmm_predict <- ClusterR::predict_GMM(test_data,CENTROIDS=test_gmm$centroids,COVARIANCE = test_gmm$covariance_matrices,WEIGHTS=test_gmm$weights)
gmm_probs <- test_gmm_predict$cluster_proba

# Cluster_Medoids
cm = ClusterR::Cluster_Medoids(test_data, clusters = 20, distance_metric = 'euclidean', swap_phase = TRUE)
cm_probs <- hard_to_prob_fn(cm_labs,n_clust=20)

# Clara_Medoids
clm = Clara_Medoids(test_data, clusters = 20, samples = 50, sample_size = 0.25, swap_phase = TRUE)
clm_probs <- hard_to_prob_fn(clm_labs,n_clust=20)

# Put cluster allocation probability matrices into list, format required for function clusterBMA::clusterBMA()
input_probs <- list(km_probs,hc_probs,gmm_probs,cm_probs,clm_probs)
test_bma_results <- clusterBMA(input_data = test_data, cluster_prob_matrices = input_probs, n_final_clust = 20) 

test_consensus_matrix <- test_bma_results[[1]] # consensus matrix
test_bma_allocation_probs <- test_bma_results[[2]] # probs of cluster allocation after BMA
test_bma_cluster_labels_df <- test_bma_results[[3]] # cluster allocations with probability and uncertainty
test_bma_table <- test_bma_results[[4]] # table - how many in each cluster?
test_bma_weights <- test_bma_results[[5]] # weights for each algo
test_bma_weights_times_priors <- test_bma_results[[6]] # weights multiplied by prior probabilities - should be the same as output [5] if prior model weights are set to be equal (default)
test_bma_consensus_heatmap <- test_bma_results[[7]] # heatmap of consensus matrix (output [1])


And here's the start of my R sessionInfo

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

Obviously, above I specify a few different clustering algorithms and use many more potential groups for each algorithm (in the code: 20); however, even if I change the number of potential groups (or the data), I basically get similar numbers of clusters. So I wondered if there might be something else that was explaining why I keep getting similar answers from clusterBMA? I wanted to make sure there wasn't an issue in my code, or something else that I was doing related to the consistent clustering solutions.

Any thoughts are much appreciated, and thanks much! Jamie.

jlhanson5 commented 10 months ago


of2 commented 10 months ago

Hi Jamie, Apologies for the delay in responding - I have been on a long break and just looking at Github for the first time in a while.

Am I right in thinking that in the situation you describe above, you have followed the steps in the vignette and the following line is used to generate the object test_data?

simulate data

test_data <- clusterGeneration::genRandomClust(numClust=5,sepVal=0.15,numNonNoisy=2,numNoisy=0,clustSizes=c(rep(100,5)),numReplicate = 1,clustszind = 3)

In this case, the true number of simulated clusters here is 5. If you change this line to e.g.

test_data_3 <- clusterGeneration::genRandomClust(numClust=3,sepVal=0.15,numNonNoisy=2,numNoisy=0,clustSizes=c(rep(100,3)),numReplicate = 1,clustszind = 3)

then I would expect that you should get a clusterBMA solution with k=3, even with individual algorithms solutions of k=20?

To get from the consensus matrix to clusterBMA allocations, the matrix factorisation step includes L2 regularisation which serves to reduce/empty any redundant clusters, which is a designed feature by the authors of this piece of the method (Duan and Dunson). If I have understood your situation correctly, this is likely why you are ending up with 5 clusters in the BMA solution since using k=20 for each clustering approach will include a lot of redundant clusters which are not agreed upon across the input solutions.

There are more details in the clusterBMA paper and the Duan & Dunson matrix factorisation papers:

clusterBMA paper https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0288000#pone.0288000.ref029

Duan & Dunson - details on matrix factorisation w/ L2 regularisation https://www.jmlr.org/papers/volume21/19-239/19-239.pdf Duan LL. Latent Simplex Position Model: High Dimensional Multi-view Clustering with Uncertainty Quantification. Journal of Machine Learning Research. 2020;21:38–1.

Can you give more detail on what you said here about the issue occurring across different datasets? "even if I change the number of potential groups (or the data), I basically get similar numbers of clusters"

In my experience, our implementation of clusterBMA tends to be responsive to differing cluster structure across different datasets and I have not experienced the tendency you describe of getting the same number of clusters (5) across different datasets.

jlhanson5 commented 9 months ago


Pardon my delay in responding... I have worked through the vignette and things work (to my knowledge). However, with real data, I really get similar solutions (always final k=5). This is even when I vary some measures/input variables and/or the number of max k for a different algorithms, etc.

How best to proceed? I can send along some code, sample data, etc. But I noticed it a few times and couldn't figure what was up... so any thoughts or guidance is appreciated. Perhaps the data just clusters that way, but I would be slightly surprised by that?

jlhanson5 commented 9 months ago

Just to follow-up and as an example: -I input 6 variables, and used 5 cluster algorithms to average... and got 5 groups -I input 5 variables, and used 5 cluster algorithms to average... and got 5 groups -I input 5 variables, and used 4 cluster algorithms to average... and got 5 groups

Those were all with scaled data, so then I input 5 variables and used 4 cluster algorithms to average (non scaled)... and got 5 groups. I'm sure that there's some similarity in the data, so some similar solutions feels reasonable... but I'm surprised that nothing really bumps things into more (or less) # of groups.

Thoughts? I have attached an R workspace zipped here.

of2 commented 9 months ago

Hi @jlhanson5, thanks for your patience with this. I found a bug leftover from testing/development in consensus_matrix_fn() manually setting the number of clusters to 5. Because this is called nested within other functions I hadn't spotted it on previous inspection.

Initially I couldn't reproduce your issue as when testing with K < 5, the L2 regularisation was emptying redundant clusters and giving me the number I expected. However I noticed the same issue as you when testing with simulated K > 5, and found this issue.

Thank you for pointing it out this crucial bug, and my apologies for the long delay in fixing it!

of2 commented 9 months ago

Closing this now

jlhanson5 commented 9 months ago

Ah perfect! Yes, I ran it this morning and got 20 clusters 🤣 🤣 🤣 Thanks for this help!