statOmics / MSqRob

Robust statistical inference for quantitative LC-MS proteomics
Other
33 stars 11 forks source link

General questions on MSqRob #51

Closed antortjim closed 6 years ago

antortjim commented 6 years ago

Hi there!

I have a couple of questions on how MSqRob works:

How does MSqRob minimise the ridge regression cost function?

I was just wondering what is the engine used by the package to find the Betas that minimise the ridge regression model? I know the engine running the linear models is lme4, but I don't know much about it. Maybe you could provide a brief explanation. How similar is it to pymc3?

How is the protein-based variance defined? In the paper https://pubs.acs.org/doi/10.1021/pr501223t you explain that MsqRob "makes use of a moderated empirical Bayes variance estimator". This variance is used as part of the model explaining y (the measured MS1 intensity) as a function of the beta to model the last, random noise term. But I am not sure how it is computed in the first place. Is it the variance of the intensity measurements of all its detected peptides across runs and treatments?

I understand that taking the moderated empirical Bayes estimate, the protein variance $\sigma_i$ is shrunk toward a common variance, as stated in the paper. As I understood it, this is good because the variance will be overestimated in proteins with few samples peptides. But what is the impact of a moderated variance estimate? Does it make easier for the fold change estimates to become significant?

Thanks for your time beforehand and for developing the tool!!!

Best regards Antonio

ghost commented 6 years ago

Hi Antonio

You probably meant this paper? http://www.mcponline.org/content/15/2/657.long The paper you refer to is our comparison paper in which MSqRob was not yet mentioned.

In classical ridge regression, the penalty term is estimated via cross validation. However, for fitting our models, we make use of the link between ridge regression and mixed models. Estimating a "best linear unbiased predictor" (BLUP) for a random effect in a mixed model also boils down to a form of penalized regression where the penalty term is equal to sigma/sigma_u with sigma the residual standard deviation and sigma_u the standard deviation of the random effect.

The engine we use is indeed lme4. How it works is described in detail over here: https://arxiv.org/pdf/1406.5823.pdf

Since I work in R, I have never used pymc3. I read on it's web page that it can use Markov chain Monte Carlo and variational inference algorithms, so I think it will definitely be suited to solve mixed models.

Regarding the empirical Bayes, we actually do exactly what is described in the limma paper (we even us the squeezeVar function with a small modification to perform this shrinkage): https://www.ncbi.nlm.nih.gov/pubmed/16646809 It is also explained in slides over here: https://courses.washington.edu/b572/TALKS/2014/AaronBaraff-2.pdf In brief, the underlying assumption is that proteins are not just random observations, but that they are a bit correlated. Thus, we can learn a bit about the variation of a single protein by looking at the variation in over all the proteins. This idea is largely similar to the idea of shrinkage estimation: if there is not enough data to reliably estimate the residual variance, the empirical Bayes variance will be close to the common variance. If there are many data points, the empirical Bayes variance will be close to the variance based on the data of that protein.

This approach is especially useful for proteins (models) that have very little data points. Instead of estimating the variance only on those few data points, we stabilize the variance by shrinking it towards a common variance. The less observations a protein has, the more its residual variance will be shrunken towards the common variance. The overall effect will be that models with very large residual variances will get a bit smaller variances and very small variances will become a bit larger. A nice consequence of this approach is that proteins with very few data points, but a variance that is small just due to random chance, will be much lower in your list (i.e. will probably no longer be significant).

We also have a tutorial paper on MSqRob, although it is more tailored towards general statistical concepts and using MSqRob in practice than towards the actual implementation of MSqRob: https://www.sciencedirect.com/science/article/pii/S1874391917301239

antortjim commented 6 years ago

Hi Ludger

I am sorry it took me 2 weeks to answer. Thank you so so much for such a comprehensive response!!

I have learned a lot :+1: :smile: I got an answer to all I needed to know! I hope also such an effort in your reply to this question can help others too and not just me.

Thousand thanks!!

Best regards

Antonio

ghost commented 6 years ago

Hi Antonio

You're most welcome! :) I will close this issue for now then.

Best regards

Ludger