im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

Interpretation of per-gene vs global dN/dS values #37

Closed JesseDGibson closed 5 years ago

JesseDGibson commented 5 years ago

Hi Inigo, I had a quick question about interpreting the output of dndscv. I have some data where after running dndscv, plotting the distribution of per-gene dN/dS values obtained from dndsout$sel_cv shows the vast majority (about 75%) of genes are above a value of 1 in w_mis for instance, with a mean of about 3 but when looking at the global value of w_mis, the mle is slightly below 1, with the upper confidence interval bound extending just above. To me this seems contradictory to have most genes well above 1 but a global value so close to it, does this seem like a reasonable output for dndscv? For a bit more information, this is non-human data that I am working with (built reference database myself), and in dndsout$sel_cv only one gene had a 'significant' q-value so maybe that has an influence on this. Any help would be much appreciated, please let me know if there's any other information on my specific use that might be useful!

im3sanger commented 5 years ago

Hello.

It is difficult for me to know to provide a detailed response without seeing the dndsout object. However, it is certainly possible under some circumstances that the mean of dN/dS ratios per gene is considerably different to the global dN/dS ratios. There are several factors to take into account.

In sparse datasets, dN/dS ratios per gene are noisy point estimates with large uncertainty. This is probably the case in your dataset, because despite dN/dS ratios per gene being nominally different from 1, only one gene is statistically significant. The background mutation rate of a gene in the dNdScv model (and so the denominator in dN/dS) is estimated based on global information (the substitution model across genes) and local information (the number of synonymous mutations in the gene). Imagine, for example, a neutral simulation with a very low number of mutations per gene (e.g. an average of 1 mutation per gene). Because missense sites are more common than synonymous sites, many neutrally evolving genes could have just one or two missense mutations and no synonymous mutations. In that case, many dN/dS values per gene will be high (>1), but the method knows that these are only point estimates and that their deviation from 1 is not unexpected by chance. The global dN/dS ratios will be correctly ~1, since the dataset was simulated neutrally. This simply means that, in such sparse datasets, the actual value of dN/dS ratios per gene are dominated by sampling noise. You can use the geneci function to calculate confidence intervals for gene-level dN/dS ratios.

Another aspect to take into account is that the global dN/dS ratio is not simply the mean of dN/dS ratios per gene. Instead, the global dN/dS ratio uses the sum of all mutations across all genes. This makes them more robust to sparse datasets than the mean of gene-level dN/dS ratios. (Note that, in doing so, longer genes will contribute more to global dN/dS ratios than shorter ones, as global ratios represent dN/dS ratios in the whole coding sequence).

I hope this makes sense.

Best wishes, Inigo