projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
272 stars 111 forks source link

WGR small inconsistencies and possible improvements #255

Closed eric-czech closed 4 years ago

eric-czech commented 4 years ago

Hi,

I was looking through the preprint and some of the code for the new WGR functionality and I love the work you guys have done on it -- thank you for sharing that all! The work that went into the documentation is much appreciated too, especially the plain-text descriptions (an all too uncommon bridge between methods of a paper and the code itself IMO).

I had a few thoughts/suggestions that I thought I would share in looking at the code:

Thanks again!

LelandBarnard commented 4 years ago

Dear Eric, thank you for your interest and comments! Regarding the matrix inversion, we are relying on the fact that the formulation of the ridge loss guarantees that XtX + λI is well-conditioned even if the matrix X is singular or nearly singular, and under these conditions the exact matrix inversion is reliable and generally faster than the pseudoinverse. We have also experimented with decomposition approaches that are faster per a matrix inverse, but the cost of the decomposition generally outweighs the gain in performance unless the number of regularization parameters (and hence the number of times we need to perform the inverse) is fairly high (much higher than the value of 5 suggested in the paper).

Great catch on the genotype scaling issue. I think we should make the change that you suggested for consistency.

Regarding the reduced block scaling, you are absolutely right about that difference between the Glow implementation and regenie (we should more clearly document this in the code). In regenie, the reduced matrix is standard scaled at once but in Glow, we standard scale within each sample block. This approximation was made mostly for the benefits of simplicity and performance, but it is analogous to the "batch norm" approach that is commonly employed in many other machine learning contexts, where it is quite robust in general and usually has only a slight regularization effect on the parameter estimates. In the testing we have done and benchmarking against the outputs of regenie this approximation seems to work quite well, but it is something we still need to test more thoroughly e.g. in smaller cohorts.

eric-czech commented 4 years ago

the ridge loss guarantees that XtX + λI is well-conditioned even if the matrix X is singular

Ahh ok, I was vaguely remembering the ridge penalty addition as making XtX better conditioned but forgot that the sum was guaranteed to be full rank. The only exception I can find to that now (after a little refreshing) is if the λ value is negative and equal to an eigenvalue of XtX, which is easy enough to prevent.

I think we should make the change that you suggested for consistency.

👍

In the testing we have done and benchmarking against the outputs of regenie this approximation seems to work quite well, but it is something we still need to test more thoroughly e.g. in smaller cohorts.

I certainly don't doubt it's similar enough to ignore the differences in the general case, but even for large sample sizes what do you do now for situations where you have say 100,002 samples and a sample block size of 10k? Doing the normalization for 2 samples in the last block is a problem no?

eric-czech commented 4 years ago

Doing the normalization for 2 samples in the last block is a problem no?

Nm, I see why not now with a sample_block_count parameterization rather than sample_block_size

karenfeng commented 4 years ago

Thanks for the comments @eric-czech! If you open a PR to resolve the genotype scaling inconsistency, I'm more than happy to review.