thibautjombart / adegenet

adegenet: a R package for the multivariate analysis of genetic markers
168 stars 64 forks source link

Not reproducible results with find.clusters #335

Open Deepak12Kaushik opened 2 years ago

Deepak12Kaushik commented 2 years ago

I try using the find.clusters function with the phenotypic data of wheat (you can think of my data set similar to USArrets dataset) for the purpose of cutting the dendrogram into these number of clusters. But every time the sequence of cluster changes like if first cluster having 4 members, second as 2 members etc. then repeating the function with similar conditions give first cluster with, say, 5 members and so on. Not reproducible results.

df is my dataset

foo.BIC <- find.clusters(df, max.n = 20, n.pca =200, scale = FALSE, stat = "BIC", method = "kmeans") plot(foo.BIC$Kstat, type="o", xlab="number of clusters (K)", ylab="BIC", col="green", main="Detection based on BIC") points(5, foo.BIC$Kstat[5], pch="x", cex=3) mtext(3, tex="'X' indicates the actual number of clusters")

foo.BIC$size foo.BIC$grp

sanderdebacker commented 2 months ago

Responding my findings here because I myself was looking for an answer to a similar problem. Hopefully this is useful for other users.

I've found this in another thread:

Odd shapes of the decrease of BIC can occur for several reasons. The possible explanations I can think of are: a) there are no clearly identifiable clusters in the data. b) there are clusters to be identified, but not enough information to disentangle different values of k. In your case this seems very likely: there are few SNPs, and if half of them are specific to one individual they are not informative in terms of clusters.

Original reference: https://lists.r-forge.r-project.org/pipermail/adegenet-forum/2011-June/000303.html

Otherwise, it would be worth increasing the number of runs of k-means (n.start, default is 10) and increase the number of iterations for each run (n.iter, default is 1e5) to gain a bit of stability. Hopefully that makes your analysis reproducible.

EDIT: just as an example, for my data the analysis stabilised for n.start=1000 and n.iter=1e9