josefin-werme / LAVA

54 stars 9 forks source link

run.multireg(param.lim) Warning: Estimates too far out of bounds #71

Closed EPTissink closed 7 months ago

EPTissink commented 7 months ago

Hi!

I'm running LAVA analyses for 3 traits. I have found an interesting locus with significant h2 and rg:

> run.bivar(locus, phenos=c("Trait1","Trait2","Trait3"))
#    phen1 phen2 rho     rho.lower    rho.upper r2     r2.lower r2.upper     p
#1 Trait1  Trait2 0.744738   0.50447   1.00000 0.554635  0.25449  1.00000     8.38397e-07
#2 Trait1  Trait3 0.633734   0.35965   0.91428 0.401619  0.12935  0.83591     5.06478e-05
#3 Trait2  Trait3 0.972419   0.81815   1.00000 0.945599  0.66936  1.00000     2.78654e-08

However, when trying to tease apart the relations with pcor and multiple regression, I encounter this error:

> run.multireg(locus, target='Trait1', adap.thresh=c(1e-04, 1e-06))
#[1] "~ Running multiple regression for outcome 'Trait1', with predictors 'Trait2', 'Trait3'"
#Warning: Estimates too far out of bounds (+-1.5) for phenotype(s) Trait2 ~ Trait1 (2.362), Trait2 ~ Trait1 (-1.663) in locus 464. 
#Values will be set to NA. To change this threshold, modify the 'param.lim' argument
[[1]]
[[1]][[1]]
  predictors outcome gamma gamma.lower gamma.upper r2 r2.lower r2.upper  p
1  Trait2 Trait1    NA          NA          NA NA       NA       NA NA
2  Trait3 Trait1    NA          NA          NA NA       NA       NA NA

The manual says

The +- threshold at which estimated parameters are considered to be too far out of bounds. If the estimated parameter exceeds this threshold, it is considered unreliable and will be set to NA.

Could you perhaps elaborate on what is meant with "unreliable" and what is the risk in interpreting the estimates when param.lim is set to e.g. 2.5?

Thanks!

cadeleeuw commented 7 months ago

The answer depends a little on the specific parameter. For (partial) correlations this mainly comes down to instability in the estimate, values well beyond +/-1 generally happen because the signal to noise ratio is very low, and hence division by noisy variance values leads to extreme values. Although the estimates can also go past +/-1 just because the true value it is estimating is very close to +/-1, if the signal strength is good enough then they won't go too far over.

This problem tends to get worse when using partial correlation, and more so when using larger numbers of predictors, since this introduces more sources of noise, and because it involves more estimated variance terms showing up in denominators in the underlying math.

For multiple regression the sitation is slightly different, since standardized regression coefficients aren't technically constrained to the -1 to 1 interval like correlations are, they can legitimately exceed them. However, standardized coefficients going far beyond +/-1 are indicative of strong collinearity, which is the case here as well: the estimated correlation between the two predictors is 0.97.

In general, presence of strong collinearity of course already makes regression parameters risky to interpret, because the model is trying to adjust the parameters to try to fit the effect of the small amounts of non-shared information of the collinear predictors. In the LAVA context, there is the additional complication of potentially low signal to noise ratio, as with (partial) correlations. In that case, the small amount of non-genetic information, and/or the differences in estimated correlations of the collinear predictors with the outcome, may be largely due to that noise.

Hence the danger in interpreting them. The estimated values seem to suggest (assuming they are individually significant in the model, which they well may not be) that the two predictor traits have strong opposing effects on the outcome trait. Yet the reality is probably that they both have moderately positive joint associations with the outcome. Consider what happens if we change the correlations of the two predictor traits with trait 1 to 0.69 (ie. the average of their current estimates, a change of only about 0.05 for each): in this case, their multiple regression parameter estimates will both be 0.35, removing any suggestion of opposing effects.

So in this case, I would probably just report that both trait 2 and 3 are quite strongly correlated with trait 1, but that they are too strongly correlated with each other to reliably tease apart the multivariate relationship.

EPTissink commented 7 months ago

Thanks for your explanation Christiaan! Very clear now