thibautjombart / adegenet

adegenet: a R package for the multivariate analysis of genetic markers
165 stars 64 forks source link

the obtained graph doesn't represent individuals as dots in DAPC #319

Closed kopelol closed 2 years ago

kopelol commented 2 years ago

Hello everyone, I'm trying to do DAPC analysis using core gene alignment fasta file obtained from 95 bacteria strains, but I can't obtain graph with individual dots.

Firstly, I tried to extract SNPs from multiple alignment fasta.

x <- ("core.gene.aln.fasta") y <- fasta2genlight(x, chunk=100)

y

/// GENLIGHT OBJECT /////////

// 95 genotypes, 198,174 binary SNPs, size: 4.7 Mb 0 (0 %) missing data

// Basic content @gen: list of 95 SNPbin @ploidy: ploidy of each individual (range: 1-1)

// Optional content @ind.names: 95 individual labels @loc.all: 198174 alleles @position: integer storing positions of the SNPs @other: a list containing: elements without names

Then, I conducted DAPC.

grp <- find.clusters(y) 11.pdf Choose the number PCs to retain (>=1) 80 22.pdf Choose the number of clusters (>=2): 5

dapc1 <- dapc(y, grp$grp) 33.pdf Choose the number PCs to retain (>=1): 80 44.pdf Choose the number discriminant functions to retain (>=1): 4 scatter(dapc1) 55.pdf

Could you please give some advice? Thanks,

leonvarhan commented 2 years ago

Hello @kopelol. I am having the exact same issue you are describing here with scatter(dapc). I'm using adegenet 2.1.5 . My plot is showing 4 groups (as in 4 numbered boxes) but no individual samples. I was wondering if you were able to identify and fix this issue with your data. Thank you!!!

kopelol commented 2 years ago

Hi @leonvarhan. Thank you for your reply. I hope so. Thanks,

zkamvar commented 2 years ago

Hi @kopelol,

You are not seeing any individual points because you are using 80 PCs to estimate 5 groups via clustering and then using the same PCs to fit the discriminant analysis to the groups that you just identified.

In short: you are over-fitting the model such that any within-group variance is vastly overshadowed by among-group variance and thus all the points within the groups are tightly packed.

kopelol commented 2 years ago

Hi @zkamvar Thank you for your advice. I understand.

Generally, how many PCs should I use?

zkamvar commented 2 years ago

Generally, how many PCs should I use?

There is not a magic number of PCs to use. For DAPC, you want to avoid overfitting by using a number that is sufficient enough to describe a vast majority of the variance (e.g enough PCs to describe ~80% of the data). I would suggest to read The DAPC tutorial, especially section 4, which goes into the instability of group memberships after overfitting.