gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
269 stars 34 forks source link

Pangenome Association Study - pyseer significant results? #78

Closed RicardoGCCR closed 4 years ago

RicardoGCCR commented 4 years ago

Hi, I am running the Pangenome Association Study pipeline, following the tutorial guidelines: prokka > panaroo > IQ-tree > pyseer; to estimate a GWAS of presence and absence genes on 4 bacterial groups that present different virulence phenotypes (each group with 30 genomes). Interestingly, the output of pyseer identifies significance on some genes present only on 2-4 genomes of one of the groups (p<1,07E-05). HOWEVER, there are a few genes on my data that are present in 29/30 genomes of only one group BUT the p value is just 0.313. Any idea why is this happening? Thanks for your assistance. Ricardo

gtonkinhill commented 4 years ago

Hi Ricardo,

It is hard to be sure without taking a closer look but this could be the result of the population structure correction used in pyseer. If the 2-4 genomes are relatively spread out over the phylogeny this will provide more support than if the 29 genomes all belong to the same clade. To look at this closer you could plot the phenotypes alongside the phylogeny to get a better idea of how the population structure may be influencing the results.

RicardoGCCR commented 4 years ago

Thank you for your reply. I also think it is a problem on the run of the phylogeny_distance.py. The output matrix K looks strange, as it is not symmetrical.
Could we just used the iq-tree.mldist file instead? as input for pyseer?

Here are the files of that phylogeny_distance.py step : https://gist.github.com/RicardoGCCR/2fa9d35b8cabb9bda98723566fb1b726

If you could take a look and tell me if you see something wrong I will deeply appreciate it.

Ricardo

gtonkinhill commented 4 years ago

I'm not sure if I'm understanding correctly but the matrix in pylogeny_K.tsv looks symmetric to me. I ran the following in R

library(data.table)
ph <- fread("~/Downloads/phylogeny_K.tsv", data.table = FALSE)
d <- data.matrix(ph[,2:ncol(ph)])
all(d==t(d))
RicardoGCCR commented 4 years ago

Hi, thank you for checking! I was confused with the non-zero diagonal of the matrix; but according to pyseer developers that is expected. I will explore the phylogeny issue as you suggested. Thank you again, Ricardo