mgalardini / pyseer

SEER, reimplemented in python 🐍🔮
http://pyseer.readthedocs.io
Apache License 2.0
112 stars 27 forks source link

Suggestions for using elastic net whole-genome models for GWAS #92

Closed apredeus closed 4 years ago

apredeus commented 5 years ago

Hello,

thank you for a great tool - it's amazing to see all the different capabilities (e.g. burden testing) in one suite. Great work!

I've recently come across a preprint that tested various GWAS approaches using simulated data (https://www.biorxiv.org/content/10.1101/795492v1), and they seem to come to a conclusion that whole-genome elastic net models substantially outperform other approaches, especially for variants with weaker effects. I seem to have such dataset, at least as far as I can tell. So I've had few questions about using the "wg enet" option.

Any other suggestions would be greatly appreciated.

johnlees commented 5 years ago

Thanks for the kind comments! In answer to your questions:

Hope that helps. Let me know if you have further questions

cizydorczyk commented 5 years ago

So to be clear,

The p-values reported from JUST the elastic net analysis -- are they association p-values that are simply not explicitly "adjusted" for population structure? From my understanding, some degree of pop structure is already accounted for since we're using all variants (most variants) in an elastic net, but improvements may be seen with an explicit measure (e.g. clonal complex designation). And sequence weighting or specifying lineages (e.g. clonal complexes) are two methods of incorporating further structure into the model?

And finally, an alternative to sequence weighting or the --lineage-clusters option is to use --distances, which tells pyseer to use a fixed effects model to test variants identified by the elastic net?

Any clarification is much appreciated!!

Thank you! Conrad

johnlees commented 5 years ago

The elastic net, and related options such as --sequence-reweighting and --lineage-clusters only relate to the beta values (effect sizes), and the selection variants that appear in the output. The implicit accounting of population structure is seen by the fact that many variants have their effect size as zero (and are not printed in the output).

The p-value columns are totally separate from the elastic net. Each variant selected by the elastic net undergoes a univariate test using the SEER model, giving an unadjusted association, and if distances are provided also the p-value from a fixed effect regression. (the LMM is not automated yet, but you can merge results if you wish to assign LMM p-values to these variants)

The current idea is that the elastic net can be useful for selecting variants, forming predictors and calculating heritability. The univariate p-values, which we don't currently get at from the elastic net, can be useful for further ranking the results. We may add to or change this in the future, but if we do I promise clearer column headers the the results to make this clear.

I hope that clarifies this issue, but let me know if there are further questions.

cizydorczyk commented 5 years ago

That really helps, thank you!! Clearly I need to read up on these methods before using them!