Closed bcajes closed 2 years ago
Merging #416 (691afbc) into master (509d9af) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #416 +/- ##
=======================================
Coverage 93.69% 93.69%
=======================================
Files 95 95
Lines 4824 4824
Branches 466 466
=======================================
Hits 4520 4520
Misses 304 304
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 509d9af...691afbc. Read the comment docs.
What changes are proposed in this pull request?
Added the following options and functionality to gwas linreg
intersect_samples: The current implementation of linear regression is optimized for speed, but is not robust to missing phenotype values. Without handling missingness appropriately, pvalues may become inflated due to imputation. When intersect_samples is enabled, samples that do no exist in the phenotype dataframe will be dropped from genotypes, offsets, and covariates prior to regression analysis. Note that if phenotypes in phenotypes_df contain missing values, these samples will not be automatically dropped. The user is responsible for determining their desired levels of missingness and imputation. Drop any rows with missing values from phenotype_df prior to linear_regression to prevent any imputation. If covariates are provided, covariate and phenotype samples will automatically be intersected.
genotype_sample_ids: Sample ids from genotype_df. i.e. from applying glow.wgr.functions.get_sample_ids(genotype_df) or if include_sample_ids=False was used during the generation genotype_df, then using an externally managed list of sample_ids that correspond to the array of genotype calls.
The alternative to using these options is to perform these pre-processing steps prior to calling linear_regression. Outside of a lot of additional boilerplate code that the alternative approach will incur, transforming the genotype_df in spark (as opposed to masking operations implemented in this change) can be very expensive.
How is this patch tested?
Issues with inflation can be replicated by simply adding missing values to the phenotype_df and disabling intersect_samples. Tested these changes on real world and simulated data and compared results to hail and statsmodel implementations to ensure pvalues and betas were concordant.