projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
264 stars 111 forks source link

Extending sample masking functionality in gwas linear regression #416

Closed bcajes closed 2 years ago

bcajes commented 2 years ago

What changes are proposed in this pull request?

Added the following options and functionality to gwas linreg

The alternative to using these options is to perform these pre-processing steps prior to calling linear_regression. Outside of a lot of additional boilerplate code that the alternative approach will incur, transforming the genotype_df in spark (as opposed to masking operations implemented in this change) can be very expensive.

How is this patch tested?

Issues with inflation can be replicated by simply adding missing values to the phenotype_df and disabling intersect_samples. Tested these changes on real world and simulated data and compared results to hail and statsmodel implementations to ensure pvalues and betas were concordant.

codecov[bot] commented 2 years ago

Codecov Report

Merging #416 (691afbc) into master (509d9af) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #416   +/-   ##
=======================================
  Coverage   93.69%   93.69%           
=======================================
  Files          95       95           
  Lines        4824     4824           
  Branches      466      466           
=======================================
  Hits         4520     4520           
  Misses        304      304           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 509d9af...691afbc. Read the comment docs.