thibautjombart / adegenet

adegenet: a R package for the multivariate analysis of genetic markers
166 stars 64 forks source link

Very slow find.clusters and dapc computation #285

Open vlattko opened 4 years ago

vlattko commented 4 years ago

I have a plink12 formatted file with 974 individuals and ~181000 filtered SNPs. I am running find.clusters and dapc for two days now in two different sessions and neither had still converged. The maximum number of clusters was set to 20, as well as number of PCs.

Is there any way to speed things up besides parallelization ?

Thanks!

zkamvar commented 4 years ago

I would suggest to reduce your data by using the loadings from a principal components analysis to identify loci that contribute the most to the primary PC axes.

vlattko commented 4 years ago

But wouldn't that bias the estimate of the number of components in find.clusters? If the markers are not in LD, different sets of markers would be contributing to different components and the correlation between PCs should be close to zero.

I am running the analysis with a subset of 10000 randomly selected markers (original filtered set was around 500k SNPs) which still appears to be too many ...