WGR small inconsistencies and possible improvements

eric-czech commented 4 years ago

Hi,

I was looking through the preprint and some of the code for the new WGR functionality and I love the work you guys have done on it -- thank you for sharing that all! The work that went into the documentation is much appreciated too, especially the plain-text descriptions (an all too uncommon bridge between methods of a paper and the code itself IMO).

I had a few thoughts/suggestions that I thought I would share in looking at the code:

Matrix Inversion
- Have you considered using something other than LU decomposition (via np.linalg.inv) for inverting (XtX + λI)? I worry that even though the whole genotype matrix would rarely be singular, I think it would be reasonable to expect that certain blocks of it may be
- I think one good solution could be to swap inv for pinv to use an SVD solver instead in: https://github.com/projectglow/glow/blob/f3edf5bb8fe9c2d2e1a374d4402032ba5ce08e29/python/glow/wgr/linear_model/functions.py#L154
- Some experiments on small datasets produced the exact same results this way when all blocks are non-singular
- Another could be to use the fact that (XtX + λI) is symmetric and switch to Cholesky decomposition
- I'm not very familiar with the performance characteristics of each, but I know the computational complexity of pseudo-inverses and cholesky can be lower than LU and even if they don't make for a big improvement in speed, I think the stability and robustness to a single block with perfectly correlated columns would be worth it
Genotype Scaling
- I saw that the demo notebook standardizes on 0 degrees of freedom for std calculations while the stddev calculated for the genotypes matrix itself does not: https://github.com/projectglow/glow/blob/75ea111e0ba86001ea213df103373125b8732778/core/src/main/scala/io/projectglow/sql/expressions/MomentAggState.scala#L68
- It's a very minor point, but perhaps this should be count instead of count - 1 to be consistent
Reduced Block Scaling
- It appears that the stage 1 predictions are standardized on a per-sample-block basis rather than across all samples as stated in the preprint and in the c++ code at: https://github.com/rgcgithub/regenie/blob/master/src/Step1_Models.cpp#L250
- Was this intentional? I can see how that would be a difficult thing to change given that it happens in apply_model under different contexts, but I thought I would point it out since it could make for pretty dramatic differences when small sample block sizes are used (and be less stable if stddev approaches 0 in a block)

Thanks again!

LelandBarnard commented 4 years ago

Dear Eric, thank you for your interest and comments! Regarding the matrix inversion, we are relying on the fact that the formulation of the ridge loss guarantees that XtX + λI is well-conditioned even if the matrix X is singular or nearly singular, and under these conditions the exact matrix inversion is reliable and generally faster than the pseudoinverse. We have also experimented with decomposition approaches that are faster per a matrix inverse, but the cost of the decomposition generally outweighs the gain in performance unless the number of regularization parameters (and hence the number of times we need to perform the inverse) is fairly high (much higher than the value of 5 suggested in the paper).

Great catch on the genotype scaling issue. I think we should make the change that you suggested for consistency.

Regarding the reduced block scaling, you are absolutely right about that difference between the Glow implementation and regenie (we should more clearly document this in the code). In regenie, the reduced matrix is standard scaled at once but in Glow, we standard scale within each sample block. This approximation was made mostly for the benefits of simplicity and performance, but it is analogous to the "batch norm" approach that is commonly employed in many other machine learning contexts, where it is quite robust in general and usually has only a slight regularization effect on the parameter estimates. In the testing we have done and benchmarking against the outputs of regenie this approximation seems to work quite well, but it is something we still need to test more thoroughly e.g. in smaller cohorts.

eric-czech commented 4 years ago

the ridge loss guarantees that XtX + λI is well-conditioned even if the matrix X is singular

Ahh ok, I was vaguely remembering the ridge penalty addition as making XtX better conditioned but forgot that the sum was guaranteed to be full rank. The only exception I can find to that now (after a little refreshing) is if the λ value is negative and equal to an eigenvalue of XtX, which is easy enough to prevent.

I think we should make the change that you suggested for consistency.

👍

In the testing we have done and benchmarking against the outputs of regenie this approximation seems to work quite well, but it is something we still need to test more thoroughly e.g. in smaller cohorts.

I certainly don't doubt it's similar enough to ignore the differences in the general case, but even for large sample sizes what do you do now for situations where you have say 100,002 samples and a sample block size of 10k? Doing the normalization for 2 samples in the last block is a problem no?

eric-czech commented 4 years ago

Doing the normalization for 2 samples in the last block is a problem no?

Nm, I see why not now with a sample_block_count parameterization rather than sample_block_size

karenfeng commented 4 years ago

Thanks for the comments @eric-czech! If you open a PR to resolve the genotype scaling inconsistency, I'm more than happy to review.

projectglow / glow

WGR small inconsistencies and possible improvements #255