sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
235 stars 32 forks source link

genee review: sgkit implementation limitations #1180

Open jdstamp opened 10 months ago

jdstamp commented 10 months ago

The current implementation of sgkit.genee only covers one special case of genee. The method does not perform the regularization of genee. In the current implementation, it is expected that regularization has happened before the method is called. Did I overlook an implementation of the regularization?

To my understanding, regularization in genee does not only serve the purpose of shrinking the effect estimates using penalties but also learning the approximately "true" SNP effects. the regularization proposed in the paper deflates the effect size estimates that arise from LD tagging. I.e. the marginal effects of linear regression performed on SNPs individually only give beta_marginal = LD_matrix * beta_true where SNPs marginally have nonzero effects but only because they tag a true causal SNP. The regularization is a crucial step in improving the mapping (e.g. https://www.cell.com/ajhg/fulltext/S0002-9297(15)00365-1) For an invertible matrix, if no regularization penalty is wanted, it would be possible to compute beta_true = (LD_matrix)^-1 * beta_marginal. Without regularization the implementation looks more like SKAT (https://doi.org/10.1016/j.ajhg.2011.05.029) even though it is still something different.

Usability

Other

tomwhite commented 10 months ago

In the current implementation, it is expected that regularization has happened before the method is called. Did I overlook an implementation of the regularization?

No, I skipped the regression step when implementing this for sgkit. There's some discussion in #692 (and #975).