Extending sample masking functionality in gwas linear regression

What changes are proposed in this pull request?

Added the following options and functionality to gwas linreg

intersect_samples: The current implementation of linear regression is optimized for speed, but is not robust to missing phenotype values. Without handling missingness appropriately, pvalues may become inflated due to imputation. When intersect_samples is enabled, samples that do no exist in the phenotype dataframe will be dropped from genotypes, offsets, and covariates prior to regression analysis. Note that if phenotypes in phenotypes_df contain missing values, these samples will not be automatically dropped. The user is responsible for determining their desired levels of missingness and imputation. Drop any rows with missing values from phenotype_df prior to linear_regression to prevent any imputation. If covariates are provided, covariate and phenotype samples will automatically be intersected.
genotype_sample_ids: Sample ids from genotype_df. i.e. from applying glow.wgr.functions.get_sample_ids(genotype_df) or if include_sample_ids=False was used during the generation genotype_df, then using an externally managed list of sample_ids that correspond to the array of genotype calls.

The alternative to using these options is to perform these pre-processing steps prior to calling linear_regression. Outside of a lot of additional boilerplate code that the alternative approach will incur, transforming the genotype_df in spark (as opposed to masking operations implemented in this change) can be very expensive.

How is this patch tested?

[x] Unit tests
[ ] Integration tests
[x] Manual tests

Issues with inflation can be replicated by simply adding missing values to the phenotype_df and disabling intersect_samples. Tested these changes on real world and simulated data and compared results to hail and statsmodel implementations to ensure pvalues and betas were concordant.

projectglow / glow

Extending sample masking functionality in gwas linear regression #416

What changes are proposed in this pull request?

How is this patch tested?

Codecov Report