mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Inflated QQ plot #129

Closed apredeus closed 3 years ago

apredeus commented 3 years ago

Hello,

I was wondering if there's a way to deal with the situation I'm seeing in my GWAS analysis. I've ran the analysis in several ways (e.g. with MDS or hierBAPS clusters with fixed effects, with several ways of generating kinship matrices for LMM, and also few ways to generate phylogenies). I get inflated p-values in all of them; some are very bad, some are less so, but none look acceptable (which would be being close to y = x in 0-2 -log pval range):

For example, this is fixed effects QQ-plot, 2nd order RhierBAPS clusters, and 10^5 sketches mash distances (this run is done with unitigs, but VCF-based analysis looks about the same):

image

LMM looks better, but not by a great deal:

image

Pretty much all fixed effects runs look like (1), and LMMs look like (2).

The phylogeny suggests there are clusters of highly clonal sequences in the collection, which is probably what makes this challenging. How would you recommend to address this? Use -- covariates of some sort? Get rid of some of the clonal sequences?

Any advice would be greatly appreciated, as always.

johnlees commented 3 years ago

It's a good idea to look at the QQ plots as you have done, as I think they are typically the most reliable diagnostic for the issue. I usually check that at least the first part of the QQ plot is along y = x (usually true for a correctly fitted LMM, often not true for fixed effects models) to show that there has been some success in control. I can see this is the case for your LMM so it is at least working correctly, but of course there's still a lot of inflation.

Unfortunately, to my knowledge, there's probably little you can do given these observed genotypes and phenotypes. When the phenotype is very strongly correlated with genetic background there's usually little that can be gleaned about associations independent of background. You could look at your top ranked LMM values to prioritise follow-up, which is something we have used in the past with similar data, but exact quantitative inference from this is limited. Specifically checking for homoplasy in candidates may also be helpful extra information.

One other association you could try would be to add covariates with your BAPS clusters and use --no-distances, though I don't know a priori whether that would be any better than the LMM.

apredeus commented 3 years ago

Thank you.

ireneortega commented 1 year ago

Same happened in my case. I am not able to understand how can be impossible to completely 'remove' the population structure while the phenotype is widely distributed among the phylogeny suggesting that the phenotype seem to not be correlated with genetic background. Can be this possible or is it common in bacterial GWAS? I know bacteria can be grouped in clonal complexes creating subpopulations and these could be the reason of the problem perhaps... Could it be due to oversampling of certain bacterial subpopulations?

Anyway, if looking at the top ranked significant genes is best way to interpret the result is ok, despite all the above. But which method should be chosen: fixed or LMM?

artmisk13 commented 5 months ago

@johnlees Hi John, thanks for the valuable input, I came across the same problem and can I please ask when you suggest prioritising the "top ranked LMM values", does it mean prioritising the hit with the smallest p-value? Thank you very much!

johnlees commented 5 months ago

@johnlees Hi John, thanks for the valuable input, I came across the same problem and can I please ask when you suggest prioritising the "top ranked LMM values", does it mean prioritising the hit with the smallest p-value? Thank you very much!

precisely, yes