projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

Minor inconsistencies in score-logistic p-values #471

Closed alev000 closed 5 months ago

alev000 commented 2 years ago

I've been testing logistic regression with correction="approx-firth". I noticed that during the preliminary calculation of score-logistic p-values (ie, the non-firth-corrected p-values that are used to determine whether further refinement is necessary), the X values go through two "residualization" steps (step A and step B). However, when correction="none", only the second residualization step (step B) is performed.

This has the potential to yield different score-logistic p-values depending on whether or not approx-firth correction is enabled. IMO, a better place for step A to occur would be after the score-logistic p-values have been computed (somewhere around here, for example).

This seems like a potential bug, although I should mention that for all of the spot-checks that I've done, the intermediate score-logistic p-values (for correction = "none" vs "approx-firth") are quite close to each other but not exactly the same.

cc @henrydavidge, @karenfeng -- Would you be able to shed any light on this?

williambrandler commented 2 years ago

yeah this is expected that there are minor discrepancies in p-values, but only for variants that show no association with the phenotype (i.e. p > 0.05). This design feature was introduced to improve performance

alev000 commented 2 years ago

@williambrandler -- Thanks for the reply. Are you referring to the difference between the score-logistic p-values and firth-logistic p-values? If so, I'm aware that score-logistic is designed to be a rough estimate, and those discrepancies are expected.

My comment above is about the fact that, when correction == "approx-firth", there are 2 redundant residualizations that happen before computing score-logistic p-values. The first residualization (what I called step A above) does eventually need to happen, but as far as I can tell, placing step A above step B doesn't actually make anything faster; it only seems to introduce some noise into the calculation of the score-logistic p-values. It's probably a tiny amount of noise in most cases, but it caused me some confusion when I was looking for reproducible results. (In particular, note that step A modifies the matrix X in-place, which can affect the result of step B.)

As mentioned above, my proposed solution would be to move step A to here, where it would no longer interfere with the score-logistic calculations. But I'm not sure whether there's some subtlety I'm missing.

williambrandler commented 2 years ago

Thank you for discovering this issue.

Are you able to join the Glow Office Hours next Wednesday at 8am (PDT) / 11am (ET) to discuss this issue further? Here are the details, please also share your email with glow.contributors@gmail.com

Wednesday, February 23⋅8:00 – 8:45am Monthly on the fourth Wednesday Join Zoom Meeting https://databricks.zoom.us/j/88600359264?pwd=dTAwZnBWWE12Y3RROEIwb2lRMmFTZz09

Meeting ID: 886 0035 9264 Passcode: 334168

williambrandler commented 2 years ago

@alev000 can you make the meeting (Wed Feb 23 at 8am (PDT) / 11am (ET)) to discuss further?