Open vlattko opened 4 years ago
I would suggest to reduce your data by using the loadings from a principal components analysis to identify loci that contribute the most to the primary PC axes.
But wouldn't that bias the estimate of the number of components in find.clusters? If the markers are not in LD, different sets of markers would be contributing to different components and the correlation between PCs should be close to zero.
I am running the analysis with a subset of 10000 randomly selected markers (original filtered set was around 500k SNPs) which still appears to be too many ...
I have a plink12 formatted file with 974 individuals and ~181000 filtered SNPs. I am running find.clusters and dapc for two days now in two different sessions and neither had still converged. The maximum number of clusters was set to 20, as well as number of PCs.
Is there any way to speed things up besides parallelization ?
Thanks!