Better handling of missing phenotypes in step 1 and step 2 of GloWGR

tl/dr: mask missing samples under the hood for step 1 and step 2 of GloWGR for quantitative and binary traits. Currently this is only done for step 2 of binary traits (approx-firth logistic regression).

For missing phenotypes, mean-imputation is required for quantitative traits (as missing values are not supported), For binary traits, missing phenotypes are masked under the hood.

There is massive inflation in the P values observed when there is missingness (at all scales, details and reproduction of issue here, https://github.com/projectglow/glow/pull/391). The only workaround is to apply a filter each time to the delta table, which adds cost and makes it impossible to run multiple phenotypes simultaneously.

To prevent inflation it is also necessary to mask missing phenotypes during offset generation (step 1) for both quantitative and binary traits. Imputation becomes problematic when dealing with phenotypes with high missingness.

Clustering of phenotypes with similar levels of missingness does not resolve the issue in biobank data, as few phenotypes cluster nicely together.

Masking during steps 1 and 2 will allow for the Delta table to be used as a single source of truth, and will dramatically accelerate end-user productivity (workarounds take weeks of engineering effort, see https://github.com/projectglow/glow/pull/391).

projectglow / glow

Better handling of missing phenotypes in step 1 and step 2 of GloWGR #393