Closed jlhanson5 closed 9 months ago
[bump]
Hi Jamie, Apologies for the delay in responding - I have been on a long break and just looking at Github for the first time in a while.
Am I right in thinking that in the situation you describe above, you have followed the steps in the vignette and the following line is used to generate the object test_data?
test_data <- clusterGeneration::genRandomClust(numClust=5,sepVal=0.15,numNonNoisy=2,numNoisy=0,clustSizes=c(rep(100,5)),numReplicate = 1,clustszind = 3)
In this case, the true number of simulated clusters here is 5. If you change this line to e.g.
test_data_3 <- clusterGeneration::genRandomClust(numClust=3,sepVal=0.15,numNonNoisy=2,numNoisy=0,clustSizes=c(rep(100,3)),numReplicate = 1,clustszind = 3)
then I would expect that you should get a clusterBMA solution with k=3, even with individual algorithms solutions of k=20?
To get from the consensus matrix to clusterBMA allocations, the matrix factorisation step includes L2 regularisation which serves to reduce/empty any redundant clusters, which is a designed feature by the authors of this piece of the method (Duan and Dunson). If I have understood your situation correctly, this is likely why you are ending up with 5 clusters in the BMA solution since using k=20 for each clustering approach will include a lot of redundant clusters which are not agreed upon across the input solutions.
There are more details in the clusterBMA paper and the Duan & Dunson matrix factorisation papers:
clusterBMA paper https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0288000#pone.0288000.ref029
Duan & Dunson - details on matrix factorisation w/ L2 regularisation https://www.jmlr.org/papers/volume21/19-239/19-239.pdf Duan LL. Latent Simplex Position Model: High Dimensional Multi-view Clustering with Uncertainty Quantification. Journal of Machine Learning Research. 2020;21:38–1.
Can you give more detail on what you said here about the issue occurring across different datasets? "even if I change the number of potential groups (or the data), I basically get similar numbers of clusters"
In my experience, our implementation of clusterBMA tends to be responsive to differing cluster structure across different datasets and I have not experienced the tendency you describe of getting the same number of clusters (5) across different datasets.
Hello,
Pardon my delay in responding... I have worked through the vignette and things work (to my knowledge). However, with real data, I really get similar solutions (always final k=5). This is even when I vary some measures/input variables and/or the number of max k for a different algorithms, etc.
How best to proceed? I can send along some code, sample data, etc. But I noticed it a few times and couldn't figure what was up... so any thoughts or guidance is appreciated. Perhaps the data just clusters that way, but I would be slightly surprised by that?
Just to follow-up and as an example: -I input 6 variables, and used 5 cluster algorithms to average... and got 5 groups -I input 5 variables, and used 5 cluster algorithms to average... and got 5 groups -I input 5 variables, and used 4 cluster algorithms to average... and got 5 groups
Those were all with scaled data, so then I input 5 variables and used 4 cluster algorithms to average (non scaled)... and got 5 groups. I'm sure that there's some similarity in the data, so some similar solutions feels reasonable... but I'm surprised that nothing really bumps things into more (or less) # of groups.
Thoughts? I have attached an R workspace zipped here.
Hi @jlhanson5, thanks for your patience with this. I found a bug leftover from testing/development in consensus_matrix_fn() manually setting the number of clusters to 5. Because this is called nested within other functions I hadn't spotted it on previous inspection.
Initially I couldn't reproduce your issue as when testing with K < 5, the L2 regularisation was emptying redundant clusters and giving me the number I expected. However I noticed the same issue as you when testing with simulated K > 5, and found this issue.
Thank you for pointing it out this crucial bug, and my apologies for the long delay in fixing it!
Closing this now
Ah perfect! Yes, I ran it this morning and got 20 clusters 🤣 🤣 🤣 Thanks for this help!
Hello colleagues using clusterBMA,
I had a question about factors that might influence the final number of clusters output from clusterBMA. Basically, I always get 5 clusters regardless of how much/which data I put in. To set the stage, I run these R commands:
And here's the start of my R sessionInfo
Obviously, above I specify a few different clustering algorithms and use many more potential groups for each algorithm (in the code: 20); however, even if I change the number of potential groups (or the data), I basically get similar numbers of clusters. So I wondered if there might be something else that was explaining why I keep getting similar answers from clusterBMA? I wanted to make sure there wasn't an issue in my code, or something else that I was doing related to the consistent clustering solutions.
Any thoughts are much appreciated, and thanks much! Jamie.