pcarbo / varbvs

Large-scale Bayesian variable selection for R and MATLAB.
http://pcarbo.github.io/varbvs
GNU General Public License v3.0
42 stars 14 forks source link

allow to pass weights to varbvs() #18

Open timflutre opened 6 years ago

timflutre commented 6 years ago

Hi Peter,

In practice, it is frequent to perform a two-stage analysis:

  1. get one BLUP per individual via a somewhat sophisticated mixed model (e.g. to account for spatial and/or temporal covariance in errors);

  2. use them in a regression on marker genotypes to find QTLs.

However, some individuals may have more records than others (due to missing data or to a specific experimental design). This is reflected in the variance of the BLUPs. One then often want to use thess variances in stage 2 as weights (see the option in stats::lm and lme4::lmer for instance; it is also available in GEMMA).

Could a weights option be implemented in varbvs?

Thanks in advance, Tim

pcarbo commented 6 years ago

@timflutre Yes, I think this is possible, although I would have to redo most of the derivations, so it isn't trivial. Is this motivated by a specific project you are working on?

timflutre commented 6 years ago

Basically everybody around me is doing such two-stage analyses, so the literature is quite large. Moreover, here is a recent preprint investigating around this. We can easily understand why it is a common scenario: fitting mixed model with "sophisticated" error covariances in addition to a large (huge) number of genetic markers as predictors is impossible with current softwares, especially when one want to perform variable selection on the genetic markers, hence the analysis being done in two steps.

I thought it would be as simple as WLS (weighted least squares) compare to simple LS, given that the weights are known. But I trust you on assessing the impact on varbvs much better than me!

pcarbo commented 6 years ago

@timflutre Thanks for the background. It isn't conceptually or technically difficult to do, but would involve reworking a substantial portion of the code. I will keep this mind and return to it when I have a chance.

timflutre commented 6 years ago

Ok, no worries, thanks!

aksarkar commented 6 years ago

@timflutre If the weights are w_1, ..., w_n, I think you can premultiply X (markers) and Y (BLUP) by diag(sqrt(w_1), ..., sqrt(w_n)) and apply varbvs on the transformed data (based on properties of GLS)

pcarbo commented 6 years ago

@aksarkar Thanks for this suggestion. So perhaps it is simpler to implement than I thought. Do you happen to know if the likelihood is correct after this transformation? I haven't worked through the math, but it seems like there may be an additional factor that needs to be included in the likelihood after the transformation.

aksarkar commented 6 years ago

The assumption in GLS is y ~ N(X beta, Sigma). Letting LL* = Sigma, L^-1 y ~ N(L^-1 X beta, I). So the likelihood should be correct.

The assumption in WLS is that Sigma is diag(w_1, ..., w_n), so L is diag(sqrt(w_1), ..., sqrt(w_n)). So my suggestion above is not quite correct (take the reciprocal of the diagonal entries).

pcarbo commented 6 years ago

@aksarkar Do you need this option as well?

pcarbo commented 6 years ago

@timflutre @aksarkar varbvs version 2.5-2 now has a weights option (for family = "gaussian" only), as well as a resid.vcov option to allow for residuals with a specified covariance matrix (e.g., a kinship matrix).

timflutre commented 6 years ago

Thanks @pcarbo !

However, when I launch varbvs with weights, it looks like the summary function doesn't report the proportion of variance explained anymore. Is it on purpose?

pcarbo commented 6 years ago

@timflutre Yes, I haven't yet updated the PVE calculations to allow for weighted samples. Do you need the PVE estimates?

timflutre commented 6 years ago

That would be great! I would expect the PVE to increase when weights are used. It is also interesting to look at the PVE to get a sense of whether or not the SNPs that are available are numerous enough and well distributed to tag all LD blocks. In my case, I am not sure of that because I only have 10k SNPs and the pairwise LD quickly becomes small for short distances. I am currently in the process of genotyping more SNPs by sequencing. So I would expect the PVE to increase when using, say, 60k SNPs compare to only 10k.

pcarbo commented 6 years ago

@timflutre Okay, I've reopened this issue and will work out the PVE calculations for weights samples.