thibautjombart / adegenet

adegenet: a R package for the multivariate analysis of genetic markers
166 stars 64 forks source link

Abnormal results in find.clusters plots #245

Open aflorebe opened 6 years ago

aflorebe commented 6 years ago

Hello, I am trying to run DAPC analysis in my genome-wide dataset incluiding 188736 genotypes for 188 individuals from 18 different geographic populations. I already know there is some genetic structure in the dataset, at least 2 groups could be defined. However, when running "find.clusters()" function in order to define the most plausible number of groups that could explain my dataset I obtain strange plots of "Cumulative variance explained by PCA" and "Value of BIC vs number of clusters":

cumulative_variance_explained_by_pca

value_of_bic_vs_number_of_clusters

This is my R script:

library("adegenet") snps <- read.PLINK(file = "file.raw", map.file = "file.map") grp <- find.clusters(snps, max.n.clust = 36, n.iter=1000) dapc1 <- dapc(snps, grp$grp) scatter(dapc1)

Do you have any idea about why obtaining these results and what do they actually mean? Could it be because the amount of genotypes and samples is such high that the function cannot work with them?

thibautjombart commented 6 years ago

Hi there

you can get this kind of graph from entirely unstructured datasets in which enough variation is retained at the PCA step. Try for instance:

x <- replicate(1e4, runif(100)) find.clusters(x)

aflorebe commented 6 years ago

Hi, Thibaut! Thanks for your answer. However, I am quite confused with it.

I can observe my dataset could be structured in at least 2 or 3 groups applying other analysis as PCA, MDS and AMOVA. Furthermore, in ADMIXTURE analysis I can observe different ancestral components for different groups. I have also performed more robust analyses based on haplotypes as ChromoPainter and fineSTRUCTURE and I can discern, even in a better way, the groups.

When I try the analysis with different subsets of my data (90000, 50000, 10000, 1000 and 100 SNPs) I only can start to see some kind of "normal" graphs with less than 1000 variants. So, I insist in my first question: is this package able to manage with genome-wide array data? Which is the maximum number of markers that can be used with the package?

thibautjombart commented 6 years ago

Hi again. Just to clarify: this discussion concerns 2 functions in the package, not the package as a whole. The general answer to your question is, from what I have seen on other datasets and what has been published: yes, it should.

The following is merely generalities, as I haven't look at this specific dataset, but it should hopefully help understand what is going on: