matsengrp / hdag-benchmark

0 stars 0 forks source link

Increasing the Realism of Our Simulations with Noise #13

Open williamhowardsnyder opened 1 year ago

williamhowardsnyder commented 1 year ago

Increasing the Realism of Our Simulations

One research question we want to address in this project is: Are true trees categorically different from MP trees in a predictable way? If the answer is yes, then it may suggest a way to modify our uniform distribution on MP trees and sample from a more realistic posterior distribution. However, since we do not have the true tree for real data, we can only answer this question for simulated data, which is notoriously unrealistic. We hope to bride this gap by identifying how our simulations are unrealistic and making directed changes to them by comparing summary statistics of real and simulated data.

Background

In our simulations, we've observed that the simulated trees have globally similar structure to an MP tree and differ as a result of small local changes. In fact, more than 70% of the nodes in the closest MP tree that aren't in the simulated tree are the result of parallel child mutations (PCMs). PCMs occur when a mutation occurs independently on more than one branch. The MP criterion says that this should never happen, but when there is hypermutation it is likely that the simulated tree will have PCMs.

We would like to extend the claim about the similarity of MP trees and simulated trees to real data. However, real data is uncertain (i.e., the phylogenetic signal is unclear due to flat posterior) and noisy (i.e., characters may be incorrect/ambiguous). These features of real sequence data make it challenging to perform phylogenetic inference. If we want our claims about the simulated data to generalize to the real data, we should ensure that the level of uncertainty and noise are realistic.

In Issue #12, we found that we can achieve realistic levels of uncertainty in our simulations by tuning hypermutability. We are also working on incorporating a more realistic version of rate variation, where multipliers are determined using a gamma distribution rather than discrete rate categories.

Incorporating Noise

In this issue, we hope to explore the affects of noise on the realism of our simulations and to see whether the similarity of the true tree to the MP tree still holds.

There are two primary ways that noise can enter an MSA:

  1. Error: Bases are sequenced incorrectly (e.g., we sequence an A when the actual base was a T).
  2. Ambiguity: Bases are sequenced ambiguously (e.g., when sequencing we are unsure of the base, so we write N, but the actual base was a T).

One important difference between these two forms of noise is that ambiguities are known noise while errors are not.

We are not aware of any software that software for introducing sequencing errors and/or ambiguity characters. However, Nicola offers some ideas for how to do this. Here is a quote from our eamil thread:

Simulating realistic sequence ambiguity I think is relatively easy, I just took the ambiguities from a real alignment and "implanted" them into the simulated alignment post-hoc.

Realistic error simulation is trickier, we don't know with certainty where the errors are in the real alignment, so I just randomly simulated them post-hoc, but at least I tried to use a simulated number of errors per genome close to that estimated from real data.

In the same thread, he provides an example script for how he simulated ambiguities and errors, but it looks unwieldy and hard to read. Perhaps we would be better off doing this from scratch.

Nicola also suggests other summary statistics to compare between the real and simulated data.

We should use these statistics, along with Parsimony Diversity to evaluate the realism of our simulated data.

matsen commented 1 year ago

A few details from his email:

First, by the minor allele frequency, do you mean the frequency of anything other than the major allele? For example, if a given column is 65% A, 20% C, 10% T, and 5% N, would the MAF be 30%?

I was thinking the MAF would be (20+10)/(20+10+65) but I'm sure there are better ways to define such summary statistics!

Once we have that, I think the statistic you propose is (parsimony score for a given site) / (MAF for that site). Is that correct?

Yes, that's correct, the idea was to get an approximation of the the number of descendants per mutation events, since apparent mutations caused by errors would typically have only one descendant (that is, sequence errors are typically interpreted as mutations on terminal branches of the tree). This will not be correct all the time, but it might be good enough. Alternatively one could consider the numbers of terminal branch mutations vs internal branch mutations.