im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

Can you explain mrfold? #53

Closed njbernstein closed 4 years ago

njbernstein commented 4 years ago

Hi there,

I was reading your paper and code, and don't quite understand the following bit:

# a. Neutral model: wmis==1, wnon==1, wspl==1
mrfold = sum(y[1:4])/sum(y[5:8]) # Correction factor of "t" based on the obs/exp ratio of "neutral" mutations under the model
 ll0 = sum(dpois(x=x$N, lambda=x$L*mutrates*mrfold*t(array(c(1,1,1,1),dim=c(4,numrates))), log=T)) # loglik null model

This mrfold correction I don't think is explained in the paper.

Why does mutrates need to be corrected? t is mutrates correct? And why is mrfold different for each hypothesis? Is this why you need to test w_syn == 1 for each hypothesis? Otherwise, it seems redundant.

im3sanger commented 4 years ago

Hello,

Thank you for your interest in dndscv.

mutrates is a vector with the estimated substitution rates per available site for each of the 192 possible trinucleotide changes. They are fixed averages across genes. However, when calculating dN/dS ratios for a given gene, we need to adjust these rates according to the estimated background mutation rate of the gene, which is done using the "t" parameter described in the Suppl material of the paper.

In the dNdSloc model, "t" is calculated based on the observed number of "neutral" mutations (i.e. synonymous mutations when using the unconstrained model, or synonymous mutations and those mutation types set to w==1 in the constrained models). In the dNdScv model, t_opt takes into account both the observed number of "neutral" mutations in the gene (Poisson observations) and the local covariates (Gamma function from the negative binomial regression), as described in the Suppl material of the paper. The mrfold factor is simply a way to calculate the "t" parameter for each gene under both models. "mrfold * t" in the dndscv code is equivalent to the description of the "t" parameters in the Suppl material of the paper.

I realise that this is not easy to explain here, but I hope this makes some sense.

Best, Inigo