Increasing the Realism of Our Simulations

One research question we want to address in this project is: Are true trees categorically different from MP trees in a predictable way? If the answer is yes, then it may suggest a way to modify our uniform distribution on MP trees and sample from a more realistic posterior distribution. However, since we do not have the true tree for real data, we can only answer this question for simulated data, which is notoriously unrealistic. We hope to bride this gap by identifying how our simulations are unrealistic and making directed changes to them by comparing summary statistics of real and simulated data.

Background

In our simulations, we've observed that the simulated trees have globally similar structure to an MP tree and differ as a result of small local changes. In fact, more than 70% of the nodes in the closest MP tree that aren't in the simulated tree are the result of parallel child mutations (PCMs). PCMs occur when a mutation occurs independently on more than one branch. The MP criterion says that this should never happen, but when there is hypermutation it is likely that the simulated tree will have PCMs.

We would like to extend the claim about the similarity of MP trees and simulated trees to real data. However, real data is uncertain (i.e., the phylogenetic signal is unclear due to flat posterior) and noisy (i.e., characters may be incorrect/ambiguous). These features of real sequence data make it challenging to perform phylogenetic inference. If we want our claims about the simulated data to generalize to the real data, we should ensure that the level of uncertainty and noise are realistic.

In Issue #12, we found that we can achieve realistic levels of uncertainty in our simulations by tuning hypermutability. We are also working on incorporating a more realistic version of rate variation, where multipliers are determined using a gamma distribution rather than discrete rate categories.

Incorporating Noise

In this issue, we hope to explore the affects of noise on the realism of our simulations and to see whether the similarity of the true tree to the MP tree still holds.

There are two primary ways that noise can enter an MSA:

Error: Bases are sequenced incorrectly (e.g., we sequence an A when the actual base was a T).
Ambiguity: Bases are sequenced ambiguously (e.g., when sequencing we are unsure of the base, so we write N, but the actual base was a T).

One important difference between these two forms of noise is that ambiguities are known noise while errors are not.

We are not aware of any software that software for introducing sequencing errors and/or ambiguity characters. However, Nicola offers some ideas for how to do this. Here is a quote from our eamil thread:

Simulating realistic sequence ambiguity I think is relatively easy, I just took the ambiguities from a real alignment and "implanted" them into the simulated alignment post-hoc.

Realistic error simulation is trickier, we don't know with certainty where the errors are in the real alignment, so I just randomly simulated them post-hoc, but at least I tried to use a simulated number of errors per genome close to that estimated from real data.

In the same thread, he provides an example script for how he simulated ambiguities and errors, but it looks unwieldy and hard to read. Perhaps we would be better off doing this from scratch.

Nicola also suggests other summary statistics to compare between the real and simulated data.

Mean and variance of the distribution of parsimony scores across genome positions. This could be useful proxies to describe the distribution of the mutation rates across the genome.
Mean and variance of the distribution of the ratios of parsimony scores over the minor allele frequency, again across genome positions. This could could help distinguish between recurrent mutations and recurrent errors.

We should use these statistics, along with Parsimony Diversity to evaluate the realism of our simulated data.

matsengrp / hdag-benchmark