mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

deflated values #197

Closed mac137 closed 2 years ago

mac137 commented 2 years ago

Hi! I have been performing a GWAS using the command below. When I plot a q-q plot on the {output.associations} I get deflated p-values. R^2 is listed as 0.89. I am a bit confused as to whether I can trust this GWAS as I do get lots of hits (5852 after filtering). I know that p-value inflation is an issue but not sure how to think about deflated p-values

    pyseer --print-samples --phenotypes {input.pheno} --phenotype-column {params.col} --covariates {input.metadata} --use-covariates 2 --uncompressed --kmers {input.unitigs} --max-dimensions {params.dimensions} --lineage --lineage-file {output.lineage} --output-patterns {output.patterns} --distance {input.dist} --lmm --similarity {input.sim} --lineage-clusters {input.lineage} > {output.associations} 2> {log}

image

I would appreciate any help. Thanks!

johnlees commented 2 years ago

What are you using as the significance threshold? I would suggest, from this QQ-plot, that there are no significant hits

mac137 commented 2 years ago

Thank you so much for your fast reply. I filtered the hits as was recommended in the tutorial (e.g. I ran python scripts/count_patterns.py kmer_patterns.txt, then I got the significance threshold, and then used the awk command to filter). Some of the hits have a very low p-value like 10^-80 and there were also a lot of significant unitigs after filtering. Not sure how to proceed in troubleshooting this. Just wanted to mention that I have installed pyseer via conda in a new environment so likely not an installation issue.

mac137 commented 2 years ago

Also, are you suggesting that I should ignore the significant hits after filtering because of the shape of this q-q plot or do you think there is an issue on how I am running things (e.g it is impossible to get a q-q plot like this with so many significant hits and I am likely doing something wrong)

mgalardini commented 2 years ago

From the plot it seems like you wouldn't have any variant with log10(p-value) above ~2. However it might be that your QQ-plot is truncated and does not show the full dataset? If so, can you post here the full one? If you are using the script we provide you can comment this line and the next one just to be sure you visualize all the points.