matsengrp / hdag-benchmark

0 stars 0 forks source link

Parsimony Diversity Exploration #12

Open williamhowardsnyder opened 11 months ago

williamhowardsnyder commented 11 months ago

Parsimony Diversity Exploration

We've been exploring the parsimony diversity of our simulated data, and trying to understand how various parameters of the simulation affect this summary statistic. We find that overall, parsimony diversity increases with the amount of hypermutation. And, to achieve realistic levels of parsimony diversity in simulated data, it is sufficient to simulate with 1% of sites hypermutable at a rate betwen of 150-200.

This post summarizes these results and describes the experiment we used to determine it.

Parsimony Diversity

Parsimony diversity (PD) is a measure of the number of maximally parsimonious topologies (MP) and how different those topologies are on a given set of sequences. Although we will likely never be able to find all the MP trees on a large dataset, Larch provides an efficient way of exploring the set of MP trees and compactly representing them. So, when we refer to PD in the following sections, we mean PD as found by Larch, which will almost always underestimate the true PD. However, when comparing the PD for different simulations, we would not expect Larch to underestimate more for one than another.

The hDAG cannot efficiently represent MP topologies, but rather represents MP histories, which are topologies that are internally labelled. To actually measure PD, we will instead look at clades among MP trees. The number of such clades should give us a good sense of the PD. Throughout the rest of this post, "PD for a given dataset" will refer to the number of clades among MP trees.

TL;DR: We measure Parsimony Diversity by looking at the number of clades among MP trees.

Experimental Description

We simulate sequences using the following procedure:

  1. Extract a topology on a given clade from the USHER tree.
  2. Use phastSim to simulate mutations along the branches of that topology.

This procedure was used in the MAPLE paper.

Our first attempt at simulation used the phastSim settings from the MAPLE paper which were:

We noticed that when simulating without hypermutation, the resulting sequences give rise to much lower PD than what we see for the real data. Additionally, by increasing the amount of hypermutation in phastSim we can increase parsimony diversity to more realistic levels.

To evaluate this effect more thoroughly, we simulate sequence datasets using 8 different UShER seed topologies (varying from 200-600 leaves) and 6 different hypermutation settings: each site has a 1% probability of being hypermutable at a rate of 0, 50, 100, 150, 200, 250. For each clade-rate pair, we conduct 5 trials.

Results

The Effect of Hypermutation

We can see the effect of hypermutation on PD in the box plot below. The x-axis is hypermutation rate and the y-axis is the increase in PD from real to simulated data (i.e., PD in simulation divided by PD in real data).

We can see that without hypermutation, the average PD of simulations is almost 50% of what we see in the real data. As we increase the hypermutation rate, the PD of simulated data compared to real data generally increases. At a hypermutation rate of 200 the PD of the simulated data is on average about the same as the PD on the real data.

A Potential Confounder: Number of Tips

We also compared the number of tips in the simulated data versus the real data. In the boxplot below, the x-axis is the same as before and the y-axis is the proportion increase in number of tips from simulated to real data.

We can see that on average the simulated datasets have about 80% of the sequences as the real dataset regardless of hypermutation level. This is because there are more duplicate sequences in the simulated data, and we only use unique sequences for our dataset. Since PD generally increases for larger clades, we would expect the simulated data to have less PD than the corresponding real data, just because it is smaller. To account for this, we can divide the number of nodes in the simulated dataset by the number of simulated tips and similarly for the real data before comparing the two, like so

$$ \Large{ y = \frac{\text{PD of simulation}}{\text{PD of real data}} \cdot \frac{\text{Number of tips in real data}}{\text{Number of tips in simulation}} } $$

Judging by the plot above, the correction factor will be between 1 divided by 0.65 and 0.9, which is in the range [1.1, 1.5]. With this correction, we're measuring something like the difference in PD per leaf. The boxplot from earlier with this correction can be seen below:

The trend is similar to before, although now the PD per leaf with 0 hypermutation is about 70% on average rather than 50%.

Admittedly, it is not totally clear whether a linear correction is the right thing to do here. We'd like something that scales the number of clades in MP trees appropriately with the difference in the number of tips. Since the number of clades scales exponentially with the number of tips (i.e., number of clades is the number of powersets of tips) perhaps an exponential correction would be better. We will stick with this correction for now, and are happy to field suggestions for a better one.

Per Clade Analysis

We also noticed that the effect of hypermutation varies significantly across different UShER clades. In the box plot below, rather than aggregating all of y-values into a single boxplot, we group by clade and hypermutation, average the PD results across the 5 trials (this uses the tips correction from the previous part), and produce a scatterplot.

For almost all clades, the trend is that hypermutation increases PD, although the difference in effect is extreme for different clades. For example, in clade AY.34.2, when we simulate with 0 hypermutation the PD in our simulations is less than 25% of what we see in the real data. It peaks at about 50% at a hypermutation rate of 200. Then there's clade AY.108 which has a PD that goes from 70% at a rate of 0 to 800% at a rate of 200. The rest of the clades see an increase somewhere between these two extremes.

At hypermutation of 150 and 200 about half the clades are above the 1 line, which indicates that this is a reasonable amount of hypermutation for generating data that has a realistic amount of parsimony diversity (for the clades that we've examined at least).

matsen commented 11 months ago

Great work, @williamhowardsnyder !

I agree that the linear correction might not be exactly what's desired here. On the other hand the conclusion is that we need a significant amount of hypermutation and I imagine that conclusion is clear irrespective of the details.

Perhaps an alternative would be to downsample the real data to the same count as you get out of the simulation and do your comparison between those two? The alternative would be to try to simulate more sequences, but IIUC that would require adding to the simulated tree.