Abnormal results in find.clusters plots

aflorebe commented 6 years ago

Hello, I am trying to run DAPC analysis in my genome-wide dataset incluiding 188736 genotypes for 188 individuals from 18 different geographic populations. I already know there is some genetic structure in the dataset, at least 2 groups could be defined. However, when running "find.clusters()" function in order to define the most plausible number of groups that could explain my dataset I obtain strange plots of "Cumulative variance explained by PCA" and "Value of BIC vs number of clusters":

cumulative_variance_explained_by_pca

value_of_bic_vs_number_of_clusters

This is my R script:

library("adegenet") snps <- read.PLINK(file = "file.raw", map.file = "file.map") grp <- find.clusters(snps, max.n.clust = 36, n.iter=1000) dapc1 <- dapc(snps, grp$grp) scatter(dapc1)

Do you have any idea about why obtaining these results and what do they actually mean? Could it be because the amount of genotypes and samples is such high that the function cannot work with them?

thibautjombart commented 6 years ago

Hi there

you can get this kind of graph from entirely unstructured datasets in which enough variation is retained at the PCA step. Try for instance:

x <- replicate(1e4, runif(100)) find.clusters(x)

aflorebe commented 6 years ago

Hi, Thibaut! Thanks for your answer. However, I am quite confused with it.

I can observe my dataset could be structured in at least 2 or 3 groups applying other analysis as PCA, MDS and AMOVA. Furthermore, in ADMIXTURE analysis I can observe different ancestral components for different groups. I have also performed more robust analyses based on haplotypes as ChromoPainter and fineSTRUCTURE and I can discern, even in a better way, the groups.

When I try the analysis with different subsets of my data (90000, 50000, 10000, 1000 and 100 SNPs) I only can start to see some kind of "normal" graphs with less than 1000 variants. So, I insist in my first question: is this package able to manage with genome-wide array data? Which is the maximum number of markers that can be used with the package?

thibautjombart commented 6 years ago

Hi again. Just to clarify: this discussion concerns 2 functions in the package, not the package as a whole. The general answer to your question is, from what I have seen on other datasets and what has been published: yes, it should.

The following is merely generalities, as I haven't look at this specific dataset, but it should hopefully help understand what is going on:

p >> n typically creates landscapes where any groups can be discriminated; see the xvalDapc function for the DAPC
on the role of the initial dimension reduction step: the idea is to partial out the random noise from structured variation, so that clusters can be better discerned; I suspect your problem here is too many axes (keeping all the noise); it is not a problem in lower dimensionality, but starts being tricky when a few dozen structured markers are diluted amongst 10,000s non structured ones; have you tried selecting less PCA axes?
can you post the first plan of the PCA, and with it a screeplot of the eigenvalues?
the AMOVA requires pre-defined groups; if you already know what groups are, did you try inputting this directly in the DAPC?

thibautjombart / adegenet

Abnormal results in find.clusters plots #245