Closed JesseDGibson closed 5 years ago
Hello.
It is difficult for me to know to provide a detailed response without seeing the dndsout object. However, it is certainly possible under some circumstances that the mean of dN/dS ratios per gene is considerably different to the global dN/dS ratios. There are several factors to take into account.
In sparse datasets, dN/dS ratios per gene are noisy point estimates with large uncertainty. This is probably the case in your dataset, because despite dN/dS ratios per gene being nominally different from 1, only one gene is statistically significant. The background mutation rate of a gene in the dNdScv model (and so the denominator in dN/dS) is estimated based on global information (the substitution model across genes) and local information (the number of synonymous mutations in the gene). Imagine, for example, a neutral simulation with a very low number of mutations per gene (e.g. an average of 1 mutation per gene). Because missense sites are more common than synonymous sites, many neutrally evolving genes could have just one or two missense mutations and no synonymous mutations. In that case, many dN/dS values per gene will be high (>1), but the method knows that these are only point estimates and that their deviation from 1 is not unexpected by chance. The global dN/dS ratios will be correctly ~1, since the dataset was simulated neutrally. This simply means that, in such sparse datasets, the actual value of dN/dS ratios per gene are dominated by sampling noise. You can use the geneci function to calculate confidence intervals for gene-level dN/dS ratios.
Another aspect to take into account is that the global dN/dS ratio is not simply the mean of dN/dS ratios per gene. Instead, the global dN/dS ratio uses the sum of all mutations across all genes. This makes them more robust to sparse datasets than the mean of gene-level dN/dS ratios. (Note that, in doing so, longer genes will contribute more to global dN/dS ratios than shorter ones, as global ratios represent dN/dS ratios in the whole coding sequence).
I hope this makes sense.
Best wishes, Inigo
Hi Inigo, I had a quick question about interpreting the output of dndscv. I have some data where after running dndscv, plotting the distribution of per-gene dN/dS values obtained from
dndsout$sel_cv
shows the vast majority (about 75%) of genes are above a value of 1 inw_mis
for instance, with a mean of about 3 but when looking at the global value ofw_mis
, the mle is slightly below 1, with the upper confidence interval bound extending just above. To me this seems contradictory to have most genes well above 1 but a global value so close to it, does this seem like a reasonable output for dndscv? For a bit more information, this is non-human data that I am working with (built reference database myself), and indndsout$sel_cv
only one gene had a 'significant' q-value so maybe that has an influence on this. Any help would be much appreciated, please let me know if there's any other information on my specific use that might be useful!