mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
111 stars 27 forks source link

possibly erroneous odds ratio? #227

Closed alexlevylab closed 1 year ago

alexlevylab commented 1 year ago

Hi, thanks for this great software, I like it a lot! However I am seeing strange results:

I am running a regression of bacterial genomes with a trait (0 or 1), and presence and absence of protein domains (0 or 1):

total number of genomes: 1458 total number genomes with trait A: 493; 5 of those encode for protein domain X total number of genomes without trait A: 965; 425 of those encode for protein domain X

I thought this would be a good example as a sanity check because the raw numbers are speak to a severe depletion of the protein domain X in the genomes with trait A.

When I run the lmm version of the program, I get the expected result (an odds ratio smaller than 1):

pyseer --lmm --phenotypes trait.tsv --pres domain_presence_absence.tsv --similarity phylogeny_similarity_lmm.tsv --cpu 15 --max-af 0.97 --filter-pvalue 1 > lmm.assoc

variant af filter-pvalue lrt-pvalue beta beta-std-err variant_h2 notes PF01111 2.95E-01 3.85E-65 9.24E-01 -8.60E-02 9.00E-01 2.50E-03

convert beta to odds ratio

predicted odds-ratio = 0.917

However, when I run the regular pyseer command, I get an unexpectedly inflated odds ratio:

pyseer --phenotypes trait.tsv --pres domain_presence_absence.tsv --distances phylogeny_distances.tsv --cpu 15 --lineage --lineage-file lineage_effects.out --max-af 0.97 --filter-pvalue 1 > pyseer.assoc

variant af filter-pvalue lrt-pvalue beta beta-std-err intercept PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 lineage notes PF011111 2.95E-01 3.85E-65 1.99E-09 9.16E+00 1.70E+00 -4.53E+00 4.13E+00 -1.15E+01 8.90E+00 -7.73E+00 -1.30E+00 -1.21E-01 1.89E+00 4.92E+00 4.17E+00 1.85E+00 MDS5

result

predicted odds-ratio = 9509.057

This odds ratio speaks to enrichment, and seems to me strangely inflated as well.

Forgive my ignorance, but is perhaps a feature of the stochastic fitting of the sigmoid to the data? Maybe somehow it diverges? Is it a bug? How can I interpret this? I would like to fix this issue, or at least know how to justify/understand why I am getting these values. Perhaps it is totally sensical, and is a hole in my understanding.

Thank you in advance!

johnlees commented 1 year ago

My guess would be that this is due to population structure, which you are (correctly) including as a correction in both methods. If it would help with your intiuition, you could try running without this (see --no-distances with the fixed effect model). You may also want to view the trait on the phylogeny. Typically it's not as simple as noting an obvious depletion, because these could all be on a single branch, so power to map it to an individual locus would be low.

alexlevylab commented 1 year ago

Hi Professor Lees,

I ran the pyseer code again with the --no-distances flag as you recommended. I am now seeing the expected result based on the raw data (negative beta; odds ratio < 1).

variant af filter-pvalue lrt-pvalue beta beta-std-err intercept PF01111 2.95E-01 3.85E-65 1.62E-86 -4.34E+00 4.54E-01 -1.01E-01

Upon observing the tree, it is indeed terribly imbalanced (see image; Red and blue = trait presence or absence; grey and navy blue = pfam presence or absence). Although the power to map the trait to a locus should be low, I am still getting a p-value of 1.99E-09 with the regular pyseer run (from my previous post).

Screen Shot 2023-02-19 at 12 20 48 PM

I therefore need a way to remove any cases like these where the population structure changes the sign of the beta, yet is passing the p-value filter. I am wondering if I can somehow use the fact that the --no-distances and the regular run had two different signs for beta (negative vs. positive) as a sign to disqualify the result. Does that make sense to you? I am not an expert and wonder if you think it's reasonable.

Thanks!

Alex

johnlees commented 1 year ago

I wouldn't necessarily recommend coming up with your own adhoc filtering criteria, and would instead stick to the best practices advice and the methods in the second half of the GWAS tutorial

Use the linear mixed model (which seemed to work here), check the QQ-plot, and check your findings both against the tree and biologically.