matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

scons and nestly for pipeline and validation #12

Closed wsdewitt closed 7 years ago

wsdewitt commented 7 years ago

away with history.bash

wsdewitt commented 7 years ago

Ideas for simulation and validation

gctree.py currently has subprograms to perform inference from data (likelihood ranking of parsimony trees from phylip's dnapars) and simulation using a branching model along with context dependent mutation (S5F). The simulation results in simulated data, and a true tree structure. Inference uses only the data, and we might ask how close (in terms of RF or SPR moves) the inferred tree (highest rank parsimony tree) is to the true tree.

A single simulation ➡️ inference run results in one true tree and a set of likelihood ranked parsimony trees.

A set of simulation ➡️ inference runs with fixed parameters results in a set of true trees (each a realization from the same underlying process) and a ranked list of parsimony trees corresponding to each. The number of parsimony trees will vary in general.

Questions

matsen commented 7 years ago

This all sounds great.

How do we integrate the likelihood and distance results across simulations? Can we just plop them on the same axis and do correlation analysis? Maybe we only deal with ranks of distance and likelihood?

I think by correlation analysis here you mean regression? Recall that these distances are already integral, so ranking perhaps isn't necessary?

wsdewitt commented 7 years ago

Regression's good, then explained variance in distance to true tree is metric? With correlation I had imagined a significant negative correlation of distance with log likelihood as evidence that our likelihood ranking is doing a thing.

For each simulated true tree, we have a set of parsimony trees, each of which has some distance to the true tree, so we could think about ranking them by proximity to the true tree. We also have the likelihood ranks, so then rank correlation comes to mind. In any case, it's not immediately obvious to me how to combine results of many simulations in a single regression/correlation analysis, but I think we want to do that somehow. Maybe there's an argument that simply aggregating them renders correlation/regression conservative?

matsen commented 7 years ago

This is great, and the kind of discussion I was hoping to have. Lets chat IRL at the next opportunity. Thanks!

matsen commented 7 years ago

After doing the base validation, I think that comparing to BEAST might be an interesting next step. Here I guess we would take a posterior distribution and then collapse each tree, to get a posterior on collapsed trees, and compare those trees to the simulated trees?

wsdewitt commented 7 years ago

Here is a basic validation plot. I did a simulation using the branching model and S5F, then dnapars gave 38 maximally parsimonious trees. The plot shows RF distance to the true tree vs GCtree likelihood. Observations:

wsdewitt commented 7 years ago

I've added some functionality to gctree --validate that uses @matsen's idea of a matrix of hamming distances of the true and inferred MRCA of the taxa. In the attached plot each point is one of the equally parsimonious trees from a single simulation. The top panel shows how the sums of these matrices compare to the GCtree likelihood for the corresponding tree. The bottom panel uses Robinson-Foulds distance.

gctree validation

wsdewitt commented 7 years ago

And here's a plot aggregating results from 100 simulations (each point in a panel is a simulation). validaggreg The x axis is the size of the parsimony forest that phylip returned. The y axes are as follows:

GCtree is doing a thing to the extent that the delta rank plots tend to be positive. Admittedly, this is a bit goofy and crude, since I'm only using the best and worst ranked trees from each parsimony forest (there's more information in the complete rank data).

matsen commented 7 years ago

This is great, Will!

matsen commented 7 years ago

@WSDeWitt You know, you could make this less "goofy" by comparing the score of the best one to the average of the others. This basically answers the question "how much better do we do by incorporating the model rather than picking one of the MP trees at random?"

Right?

wsdewitt commented 7 years ago

I guess I expect that to be less powerful, since the average score will in general be higher than the min likelihood's score—it mixes in trees across the whole likelihood range. To use all the data, I was thinking of trying a spearman rho for correlation of ranks between RF (or MRCA) and likelihood, and I could color the points by p value (which I expect will show higher significance for larger forest sizes).

matsen commented 7 years ago

When I glanced at the plots you were making, it sure seemed to me that there was no signal in the likelihood in its lower quantile for estimating distance. I wonder if this would noisify your rho.

matsen commented 7 years ago

Have you updated your simulation regime? Again staring at the plots and coming up with what seems like a good stats model seems like the way forward. Perhaps just cooking up a PDF flip-book of plots?

wsdewitt commented 7 years ago

Good point about noise, so maybe Pearson on the raw values (not ranks)?

I did fiddle with params to target a similar number of alleles (49) and total cells (70) as the Tas data, and am now getting more parsimony trees. Here's a zip of the 100 simulated trees.

wsdewitt commented 7 years ago

I played around with aggregation based on the quantile of the MLE tree, but this was clunky because of the quantile boundary when there are only a few trees (say two). I found a more direct visualization is to simply the count the total number of trees that are at least as good (by RF or MRCA) as the MLE tree (inclusive). Here's a new aggregation plot, with the line indicating the null expectation. validaggreg It's still not the most clear figure, but I think getting closer.

wsdewitt commented 7 years ago

Err, that y axis was screwy, this one's right I think validaggreg

matsen commented 7 years ago

Nice work!

You might consider a hexbin plot cause it looks like you're getting a lot of overplotting on the x axis.

Also, I'd suggest ax.set_aspect('equal') even if you prefer a scatter.

wsdewitt commented 7 years ago

I've set up some outer aggregation in nestly that makes summary stats of simulation/inference results at various simulation parameter values. As a prototype, here's a plot showing how various quantities vary with different baseline mutation rates (lambda_0). aggregation At each parameter level I did just 10 simulations, so I don't think this is telling us anything yet. I'm going to rerun with 100 sims at each point, and a denser grid on lambda_0.

wsdewitt commented 7 years ago

Here's an aggregation figure from a more exhaustive run. All these simulations were done with the same progeny distribution (poisson with lambda=1) and 100 total cells (using Erick's idea to downsample the first generation at which the population exceeds 100). For each of 10 mutation rates (x axis), 100 simulations were done. It took a surprisingly long time to get the split axis on the top panel to look right. The other three panels are currently referencing the inferred GCtree tree, but I think I want to make a version of this figure using the true tree instead. That would just characterize the simulation, rather than any inference. aggregation

krdav commented 7 years ago

Some plotting of GCtree vs. IgPhyML head to head: aggregation.pdf aggregation1.pdf aggregation2.pdf

There is a little bit better performance for GCtree of the MRCA measure for correctness of ASR. The MRCA can be somewhat inflated by the fact that it runs on Ntaxa^2 and each hamming distance error will be multiplied further with the allele frequency. I still need to implement the new ASR metric, but I hope be a bit more descriptive. However that aside both methods are impressively precise when it comes to getting both the topology and the ancestral states right.

wsdewitt commented 7 years ago

So I've been playing with summary stats on the simulations that use the raw (observable) data (no tree). One metric is the distribution of hamming distances of simulated sequences to the naive sequence. This plot shows the (kd-smoothed) CDF of this for 100 simulations at fixed (reasonable) parameter values (colors) and for the Tas et al. data (black).

gctree simulation stats

Clearly there are some differences (the real data tends to be farther away).

Another thing I wanted to get at was allele frequency, but also the intuition that higher frequency nodes have more mutant offspring. But remember: I don't want to have to use a tree. I decided a poor-man's version of this is to make a genotype network for each data set, and compute the degree distribution over observed sequences. This gets us sort of a poor-man's version of this intuition I think: higher frequency alleles have higher degree (ostensibly because they have more mutant offspring at one substitution distance). Here's the 2D distribution, again with 100 simulations (colors) and Tas et al. (black). There's definitely some overplotting going on in the plane, but tough to fix with hexbin since there are multiple scatters (maybe I'll try sizing the markers).

gctree simulation stats2

Hopefully this something like this will allow us to make a case that simulations with some (other) reasonable parameters produce realistic data according to all the summary stats we can think of (open to suggestions).

wsdewitt commented 7 years ago

Closing this, since we have scons/nestly doing this now.