mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
104 stars 25 forks source link

Suggestions needed for the whole-genome model output #243

Open AmayAgrawal opened 11 months ago

AmayAgrawal commented 11 months ago

Hi,

This is not an issue but more of a question from my side.

I used the LMM model implemented in Pyseer to perform my GWAS analysis and figured out some significant variants that plays a role in resistance. Next, I read few review papers where they benchmarked different GWAS methods and they showed that for moderate to high LD (which is the case with the bacteria I am working), whole-genome models implemented in Pyseer performs better as compared to LMM's. So I wanted to try those models also on my data. Now, I have two questions regarding this:

1) I saw that whole-genome models are usually used for prediction of phenotype, but is it okay to use them for performing GWAS and getting the significant variants?

2) Anyways, I tried to use whole-genome models to perform GWAS on my data and used --distances parameter to calculate an adjusted p-value. After performing the GWAS, a part of my output sorted on lrt-pvalue looks like below:

Variant af filter-pvalue lrt-pvalue beta notes
1_7570_C_T 0.161 3.67E-48 2.93E-47 4.24 matrix-inversion-error
1_7585_G_C 0.764 6.97E-31 4.94E-36 4.12 matrix-inversion-error
1_1473246_A_G 0.253 5.51E-33 3.47E-27 0.287 matrix-inversion-error
1_4269271_A_G 0.0792 3.01E-24 1.21E-23 0.871 bad-chisq
1_2155168_C_G 0.606 2.84E-20 1.32E-17 -0.463 matrix-inversion-error
1_761155_C_T 0.384 2.95E-07 2.14E-09 0.0511 matrix-inversion-error
1_2157481_C_T 0.0453 1.11E-06 2.61E-08 -0.64 bad-chisq
1_2157569_C_A 0.0453 1.11E-06 2.61E-08 -4.97E-15 bad-chisq

In this output, if I look at the 'notes' column, I can see 'matrix-inversion-error' and 'bad-chisq' for all printed variants. So I am not sure whether I should consider this as significant variant or not (Sorry if I this information is already present in documentation and I missed it)? To correct for multiple testing in the LMM model, I used the --output-patterns parameter and then used the count_patterns.py script to determine Bonferroni corrected p-value threshold. But when I use the whole-genome model, I cannot use the --output-patterns parameter here. So what strategy can I use here to correct for multiple testing?

Thanks a lot in advance

johnlees commented 10 months ago

When you get the p-values from the whole genome model, it is actually just running GWAS with the fixed effects model and appending these to the table. Here it looks like that model might not be working too well, given those error messages. Maybe try the LMM to get your p-values?