matsengrp / ecgtheow

Ancestral lineage reconstruction using BEAST or RevBayes
2 stars 2 forks source link

validate! #6

Open matsen opened 7 years ago

matsen commented 7 years ago

Of course, the next step after checking out the previous inferences is to validate on simulations.

This would be using @krdav 's framework.

I hope to see some plot relating inferred posterior probability of a sequence to how likely it is to be correct (i.e. in the simulated tree).

matsen commented 6 years ago

I also think that it would be nice to resolve whether the "Amrit correction" helps or not.

I had originally thought that it would be nice to do this as part of Kristian's work, but the scope for his is to apply existing ML and MP methods. Besides, there aren't any implementations of the Amrit correction in that sort of software.

We can also compare MrBayes unrooted with BEAST + clock. Will be quite enlightening, IMHO!

matsen commented 6 years ago

Kristian's code thus far is designed to mimic the Victora-type experiments in which we have cells extracted from one mouse germinal center.

We're interested in the human case. This are longer and less balanced trees. These will come from intermediate sampling timepoints as described here. We'll continue to discuss, but for the time being:

  1. Can we drive this way of using Kristian's code using our SCons?
  2. Can we use summary statistics such as the distribution of root-to-tip distances to make sure that we're hitting the right target simulation parameter regime?

Eventually we'd even be able to simulate from the extreme of broadly-neutralizing antibodies in which there is a very long period of affinity maturation. That's a bigger deal, as it may require us to change the target antibody as the simulation progresses.

dunleavy005 commented 6 years ago

This plot shows the "typical" root-to-tip distances from RevBayes trees for the BF520.1-igh seed (naive correction made little difference). image

This plot shows the root-to-tip distances from Kristian's simulator by running scons --simulate-data --T=10,60,130 --n=10,100,5 --lambda=2 --lambda0=0.365 --target-dist=10 --target-count=10 --carry-cap=1000 --skip-update=100. image

Clearly, by choosing spaced-out enough intermediate times and appropriate downsampling counts, we can match distributions pretty well.

dunleavy005 commented 6 years ago

image ^This is the simulated tree image.

matsen commented 6 years ago

Wonderful, @dunleavy005 ! I'm a little confused about the difference between the x axis labels though.

dunleavy005 commented 6 years ago

Branch lengths in revbayes trees are the expected number of DNA substitutions / site. Kristian's branch lengths are the number of DNA substitutions (for the whole 355-long sequence). They aren't comparable but I imagine I could get that scale by running it longer (dividing by 355 goes into 0.02-0.04 range, which isn't near 0.8-1).

matsen commented 6 years ago

OK, with that final tweak it seems like we're in good shape. 👍

dunleavy005 commented 6 years ago

This is the best FastTree fit to the BF520.1-igh cluster data (see the units, more reasonable): image

Not sure what revbayes is doing (asking Andy about it), but maybe we just accept that scalings are a bit different and not comparable. Note that I also ran the simulated data through FastTree and got distance magnitudes similar to the true simulated root-to-tip-distances shown above and this new ^ histogram, BUT ran the same data thru revbayes and got path distance magnitudes close to 1.

matsen commented 6 years ago

Super! Time to see how the programs fare in this parameter regime!

dunleavy005 commented 6 years ago

After speaking with Andy and doing some experimental testing, think we realized the problem with revbayes was over-diffuse exponential branch length priors.

Exp(10) ----> Exp(100) made a difference.

dunleavy005 commented 6 years ago

OK, after some fixing, here is the revbayes root-to-tip distance distribution: image

Here, is a simulated root-to-tip distance distribution using scons --simulate-data --T=50,105,150 --n=5,90,5 --lambda=2 --lambda0=0.365 --target-dist=15 --target-count=30 --carry-cap=1000 --skip-update=100 (15 AA's from naive-BF520.1, FYI). image

Good matching for this parameter setting!

P.S. here is the simulated tree SVG: image