matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Ancestral sequence reconstruction #7

Closed lauradoepker closed 5 years ago

lauradoepker commented 7 years ago

David and I just created trees based on Duncan's analysis of the sequences I gained recently. It would be really great if there was a way to reconstruct the sequences at internal nodes of the trees. Ie. ancestral sequences that we didn't directly observe. Any thoughts? 🤔

dawahs commented 7 years ago

I sent this via email, but I'm not sure if it went through. Will reply directly here from now on. Also I believe @WSDeWitt has a way of reconstructing ancestral states through phylip. Correct me if I'm wrong?


The methods for this that I am familiar with are bppancestor and this modified version of nextflu ancestral sequence reconstruction: https://github.com/matsengrp/abphylo/blob/nextflu_validation/python/asr_nextflu.py

Though I think bppancestor by default does not output internal nodes.

wsdewitt commented 7 years ago

Yeah phylip can print the unobserved ancestral seqs, using either max parsimony (dnapars) or max likelihood (dnaml) programs. I ran your fasta through dnaml. The attached outtree.txt contains the newick tree, and the associated outfile.txt file contains a bunch of data including: an ascii tree with internal nodes numerically labelled; a section beginning "Probable sequences at interior nodes" that shows the sequences of all nodes in a (somewhat annoyingly phylipesque) wrapped format. Edit: deleted attachments, emailing instead.

dawahs commented 7 years ago

The methods for this that I am familiar with are bppancestor and this modified version of nextflu ancestral sequence reconstruction: https://github.com/matsengrp/abphylo/blob/nextflu_validation/python/asr_nextflu.py

Though I think bppancestor by default does not output internal nodes.

On Fri, Oct 21, 2016 at 1:32 PM, Laura Noges notifications@github.com wrote:

David and I just created trees based on Duncan's analysis of the sequences I gained recently. It would be really great if there was a way to reconstruct the sequences at internal nodes of the trees. Ie. ancestral sequences that we didn't directly observe. Any thoughts? 🤔

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/matsengrp/cft/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AOVwi62XJhy7ZNzkX2CASLAn2mMIbVWVks5q2SFPgaJpZM4KdjtE .

matsen commented 7 years ago

Just to point out some alternate means of doing ASR that are almost certainly faster than dnaml:

trvrb commented 7 years ago

RAxML will work. However, I didn't like the single pass ancestral reconstruction it does. Its algorithm is to walk up the tree computing partial likelihoods and then take the state with the maximum partial for each node.

When BEAST samples an ancestral state reconstruction, it does a forward/backward pass to compute partials, and then sample from the root downwards.

In TreeTime, Richard does a similar forward/backward to select the most likely root state and then update partials and on down the tree.

matsen commented 7 years ago

Thanks a bunch for the thoughts, @trvrb . Like you say in that issue, the bottom-up means of doing ASR is not computing the true marginal probabilities for bases being at various sites. So, I guess bye-bye RAxML. I'm not completely sold on progressively selecting the ML state either in this case with potentially long branch lengths. Given that we're trimming the tree now, I think we should be fine computationally using dnaml.

I suppose I'm implicitly believing that it's calculating things based on marginal probabilities, but I'll just ask Joe.

Another alternative would be to use bppancestor.

matsen commented 7 years ago

From Joe F himself:

We are using dnaml over here for ancestral sequence reconstruction. I just wanted to double check something with you that wasn't obvious to me from reading the documentation.

I wanted to make sure that dnaml does ASR using true marginal probabilities of states at internal nodes, which requires likelihood vectors coming up from the tips and down from the root. In constrast, it turns out that RAxML does ancestral sequence reconstruction only with a bottom-up pass.

Yes, Dnaml does ancestral sequence reconstruction (when you turn on the proper option) by pruning likelihood inwards toward that node from all directions (upward and downward) so as to have the result depend on the states at all tips, properly. I am startled to hear that RAxML does not do this, but perhaps RAxML had the partial results from the top-down pass already lying around in the nodes so it can take both into account. In that case, when using the same nucleotide substitution model, both programs should always get the same result, aside from rounding error.

matsen commented 7 years ago

Joe in a subsequent email apologized for the lack of models in phylip. What are we using now, @WSDeWitt ?

@vnminin suggested that we sample ancestral sequences.

wsdewitt commented 7 years ago

I just used the default: one category of site, constant rate. This is something we could play with. dnaml lets us specify the number of categories, and the model for rate variation: invariant, gamma, mixed gamma and invariant, general HMM (with patch size for autocorrelation of rates).

matsen commented 7 years ago

Read the manual some and o_O. Things have progressed a little since the last dnaml update. Let's just use 4 gamma categories with alpha = 0.5.

lauradoepker commented 7 years ago

@matsen Update: I've taken a look at the indels for seed 043 (Vh, late LN4 timepoint) and it looks like the indel sequences that were pulled out are largely the same V and J, while the CDR3s (ie. the NTI and D genes) are very different. This case study makes me think that the indels are indeed not worth pursuing since they are so different from the naive and seed sequences.

matsen commented 7 years ago

4 gamma categories with alpha = 0.5 <- @WSDeWitt , is this implemented? If so, close this issue.

eharkins commented 5 years ago

Recently with @lauradoepker 's qa013 data, there arose a question about branch length normalization because @psathyrella 's local branching metrics script was saying dnaml trees had max branch depth greater than 1 even though we thought branch lengths were being normalized by evolutionary distance by dnaml. @afmagee 's and @psathyrella 's theory for why these trees were so long was basically:

Without a rates-across-sites model, sites that evolve very quickly will drag the distances up way more than makes sense if we use the model to predict the number of differences. A very screwy transition-transversion ratio can also screw this up a bit, in that the branch lengths have to go way up to account for the observed number of transitions when really they're just way more common than transversions. I don't know how these shm models fold down into standard 4x4 CTMC rate matrices, but I'd be surprised if they didn't have pretty large differences (and it looks like dnaml assumes a ratio of 2, which is I think a bit lower than I'd expect, but it's parameterized differently from the parameter that I know how to interpret)

This promtped a comparison of dnaml using raxml as control to see if raxml would optimize and arrive at a different answer for this transition-transversion ratio, which we were setting statically with dnaml at 2.

Andy suggested we compared total lengths of the dnaml, raxml trees; they look very different:

(cft) eharkins@quokka:/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013$ fd comparison -x cat
{"dnaml_tree_length": 13.811449999999992, "raxml_tree_length": 9.562892999999999}
{"dnaml_tree_length": 14.394819999999996, "raxml_tree_length": 8.269457000000015}
{"dnaml_tree_length": 13.471299999999996, "raxml_tree_length": 10.034935999999993}
{"dnaml_tree_length": 23.558949999999996, "raxml_tree_length": 8.17013000000002}
{"dnaml_tree_length": 20.535890000000013, "raxml_tree_length": 5.976603000000015}
{"dnaml_tree_length": 28.949550000000016, "raxml_tree_length": 8.328276000000008}
{"dnaml_tree_length": 21.095760000000016, "raxml_tree_length": 6.7246570000000085}
{"dnaml_tree_length": 21.917960000000004, "raxml_tree_length": 7.420377000000014}
{"dnaml_tree_length": 18.510790000000014, "raxml_tree_length": 8.226055000000013}
{"dnaml_tree_length": 18.932889999999986, "raxml_tree_length": 7.850547000000018}
{"dnaml_tree_length": 16.686429999999998, "raxml_tree_length": 10.392825999999985}
{"dnaml_tree_length": 18.452249999999996, "raxml_tree_length": 7.9741440000000186}
{"dnaml_tree_length": 9.660539999999994, "raxml_tree_length": 9.005452999999994}
{"dnaml_tree_length": 24.717340000000043, "raxml_tree_length": 7.476995000000013}
{"dnaml_tree_length": 16.01378999999999, "raxml_tree_length": 5.005726000000007}
{"dnaml_tree_length": 15.734530000000005, "raxml_tree_length": 6.572259000000018}
{"dnaml_tree_length": 22.471440000000005, "raxml_tree_length": 6.76704500000002}
{"dnaml_tree_length": 13.330339999999977, "raxml_tree_length": 4.382172000000002}

I'm posting this here because @psathyrella and I are not comfortable making an executive decision on tree building programs or parameters, and this seems like where the discussion left off re: choosing a method of ASR. cc @matsen @metasoarous

psathyrella commented 5 years ago

Great, nice summary. So it seems like we know that dnaml is giving us pathologically long trees in some significant fraction of cases. I think what we're not sure of is whether that means we should fiddle with dnaml parameters, use a different program, or not worry about it. Laura didn't seem terribly fussed about the issue when we told her, presumably since she's not using the tree lengths at this point. Of course even if that's the case, that doesn't mean six months from now she won't start using them, having at that point forgotten that they're terrible.

The issue is also a problem for my lb metric stuff, since it effectively changes the tau value by a factor of 2 to 4. But I'm not really worried about it just now, in practice it just means my previous assumption that I should use the cft tree since it's more accurate isn't correct.

afmagee commented 5 years ago

We also might worry about the topology itself. It's possible that the dnaml trees are just longer (by a factor of 2 or more in many cases). But if the branch lengths can be that far off, it seems worth worrying about the topology in general.

metasoarous commented 5 years ago

Thanks for putting this information together @eharkins.

Dnaml may give us wonky branch lengths, but unfortunately we have yet to find software that does as a good a job at ancestral sequence reconstruction. And this is not for lack of trying (#130, #131, #149, #170). To summarize, we tested out about a half dozen different programs, and they more or less all sucked for one reason or another. In some cases the results they were outputting were clearly mangled. In other's they just didn't stack up to the simulation-based benchmarks that @krdav put together. So while I'd love nothing more than to use a different piece of software, we're likely best off sticking with dnaml for the time being.

If we can get more reasonable branch lengths from dnaml by adjusting the model, I'm all for that.

afmagee commented 5 years ago

How difficult is it going to be to add steps to the pipeline? There would need to be a step before running dnaml that estimates these parameters, and getting better parameter values is not trivial without a program that does tree inference.

One could do some counting-based approximations to get in the ballpark for the transition-transversion ratio, but in all my exploration of hacky cheap ways to estimate ASRV parameters from the sequences without trees, I have never been particularly successful.

If it is possible to insert something like RAxML into the pipeline, then decent parameter estimates are possible. In this case, dnaml does not need to do as much work to optimize the tree if the RAxML tree is fed in as a starting tree, or indeed any work if the RAxML tree is fed in and tree optimization is turned off, leaving dnaml only to evaluate ancestral states.

metasoarous commented 5 years ago

Adding steps to the pipeline is not difficult. We could easily add a step to run RAxML for parameter estimates and/or starter/guide trees. I think if we were to get too different in what we're doing here, it would be worth running @krdav's tests again (can you please point us to code for these Kristian?).

It might also be worth looking back through the software we tested in the past and seeing if any of the more egregious bugs have been fixed with the software we thought would be most promising, but ended up having issues in the output, like Prank.

psathyrella commented 5 years ago

While it's certainly not doing phylogenetics, the partis parameters have an awful lot of information about the mutation in the sample, and I'd find it hard to believe that we couldn't get a good enough estimate to at least avoid the kind of pathology we're seeing. And they'd already be sitting there, so it'd just be a matter of a few lines of python to read in what you need.

eharkins commented 5 years ago

Here are the comparisons between original dnaml runs and raxml:

[
{"dnaml_tree_length": 14.394819999999996, "rf_dist": 322, "raxml_tree_length": 8.269457000000015, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385-m8/unseeded--50k-isub-1-part-0/clust-21/min_adcl-dnaml"}, 
{"dnaml_tree_length": 13.811449999999992, "rf_dist": 308, "raxml_tree_length": 9.562892999999999, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385-m8/unseeded--50k-isub-2-part-0/clust-20/min_adcl-dnaml"}, 
{"dnaml_tree_length": 13.471299999999996, "rf_dist": 310, "raxml_tree_length": 10.034935999999993, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385-m8/unseeded--50k-isub-0-part-0/clust-20/min_adcl-dnaml"}, 
{"dnaml_tree_length": 28.949550000000016, "rf_dist": 238, "raxml_tree_length": 8.328276000000008, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385/unseeded--50k-isub-2-part-0/clust-0/min_adcl-dnaml"}, 
{"dnaml_tree_length": 20.535890000000013, "rf_dist": 254, "raxml_tree_length": 5.976603000000015, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385/unseeded--50k-isub-2-part-0/clust-1/min_adcl-dnaml"}, 
{"dnaml_tree_length": 21.095760000000016, "rf_dist": 240, "raxml_tree_length": 6.7246570000000085, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385/unseeded--50k-isub-0-part-0/clust-0/min_adcl-dnaml"}, 
{"dnaml_tree_length": 15.734530000000005, "rf_dist": 252, "raxml_tree_length": 6.572259000000018, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-g-d765/unseeded--50k-isub-1-part-0/clust-0/min_adcl-dnaml"}, 
{"dnaml_tree_length": 23.558949999999996, "rf_dist": 282, "raxml_tree_length": 8.17013000000002, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-g-dm351/unseeded--50k-isub-0-part-0/clust-1/min_adcl-dnaml"}, 
{"dnaml_tree_length": 21.917960000000004, "rf_dist": 270, "raxml_tree_length": 7.420377000000014, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765/unseeded--50k-isub-0-part-0/clust-2/min_adcl-dnaml"}, 
{"dnaml_tree_length": 24.717340000000043, "rf_dist": 276, "raxml_tree_length": 7.476995000000013, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-dm351/unseeded--50k-isub-1-part-0/clust-3/min_adcl-dnaml"}, 
{"dnaml_tree_length": 16.01378999999999, "rf_dist": 222, "raxml_tree_length": 5.005726000000007, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-dm351/unseeded--50k-isub-2-part-0/clust-0/min_adcl-dnaml"}, 
{"dnaml_tree_length": 22.471440000000005, "rf_dist": 258, "raxml_tree_length": 6.76704500000002, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-dm351/unseeded--50k-isub-0-part-0/clust-3/min_adcl-dnaml"}, 
{"dnaml_tree_length": 18.510790000000014, "rf_dist": 312, "raxml_tree_length": 8.226055000000013, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-1-part-0/clust-7/min_adcl-dnaml"}, 
{"dnaml_tree_length": 18.932889999999986, "rf_dist": 308, "raxml_tree_length": 7.850547000000018, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-2-part-0/clust-7/min_adcl-dnaml"}, 
{"dnaml_tree_length": 16.686429999999998, "rf_dist": 328, "raxml_tree_length": 10.392825999999985, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-2-part-0/clust-9/min_adcl-dnaml"}, 
{"dnaml_tree_length": 18.452249999999996, "rf_dist": 304, "raxml_tree_length": 7.9741440000000186, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-0-part-0/clust-7/min_adcl-dnaml"}, 
{"dnaml_tree_length": 9.660539999999994, "rf_dist": 312, "raxml_tree_length": 9.005452999999994, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-0-part-0/clust-18/min_adcl-dnaml"}, 
{"dnaml_tree_length": 13.330339999999977, "rf_dist": 238, "raxml_tree_length": 4.382172000000002, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385/QA013.2-igl/seed-part-0/seed-cluster/min_adcl-dnaml"}
]

Here are the comparisons between raxml-optimized parameters (t-ratio, alpha) dnaml runs and raxml:

[
{"dnaml_tree_length": 6.892220000000002, "rf_dist": 332, "raxml_tree_length": 8.269457000000015, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385-m8/unseeded--50k-isub-1-part-0/clust-21/min_adcl-dnaml"}, 
{"dnaml_tree_length": 8.114159999999995, "rf_dist": 320, "raxml_tree_length": 9.562892999999999, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385-m8/unseeded--50k-isub-2-part-0/clust-20/min_adcl-dnaml"}, 
{"dnaml_tree_length": 7.609410000000001, "rf_dist": 328, "raxml_tree_length": 10.034935999999993, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385-m8/unseeded--50k-isub-0-part-0/clust-20/min_adcl-dnaml"}, 
{"dnaml_tree_length": 8.491939999999992, "rf_dist": 256, "raxml_tree_length": 8.328276000000008, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385/unseeded--50k-isub-2-part-0/clust-0/min_adcl-dnaml"}, 
{"dnaml_tree_length": 6.16966, "rf_dist": 274, "raxml_tree_length": 5.976603000000015, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385/unseeded--50k-isub-2-part-0/clust-1/min_adcl-dnaml"}, 
{"dnaml_tree_length": 6.754399999999997, "rf_dist": 256, "raxml_tree_length": 6.7246570000000085, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385/unseeded--50k-isub-0-part-0/clust-0/min_adcl-dnaml"}, 
{"dnaml_tree_length": 5.85395, "rf_dist": 284, "raxml_tree_length": 6.572259000000018, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-g-d765/unseeded--50k-isub-1-part-0/clust-0/min_adcl-dnaml"}, 
{"dnaml_tree_length": 7.385840000000001, "rf_dist": 304, "raxml_tree_length": 8.17013000000002, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-g-dm351/unseeded--50k-isub-0-part-0/clust-1/min_adcl-dnaml"}, 
{"dnaml_tree_length": 7.252199999999998, "rf_dist": 298, "raxml_tree_length": 7.420377000000014, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765/unseeded--50k-isub-0-part-0/clust-2/min_adcl-dnaml"}, 
{"dnaml_tree_length": 7.635629999999993, "rf_dist": 294, "raxml_tree_length": 7.476995000000013, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-dm351/unseeded--50k-isub-1-part-0/clust-3/min_adcl-dnaml"}, 
{"dnaml_tree_length": 5.119679999999998, "rf_dist": 230, "raxml_tree_length": 5.005726000000007, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-dm351/unseeded--50k-isub-2-part-0/clust-0/min_adcl-dnaml"}, 
{"dnaml_tree_length": 7.177330000000001, "rf_dist": 272, "raxml_tree_length": 6.76704500000002, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-dm351/unseeded--50k-isub-0-part-0/clust-3/min_adcl-dnaml"}, 
{"dnaml_tree_length": 6.57643, "rf_dist": 324, "raxml_tree_length": 8.226055000000013, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-1-part-0/clust-7/min_adcl-dnaml"}, 
{"dnaml_tree_length": 6.807979999999998, "rf_dist": 316, "raxml_tree_length": 7.850547000000018, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-2-part-0/clust-7/min_adcl-dnaml"}, 
{"dnaml_tree_length": 6.2418, "rf_dist": 340, "raxml_tree_length": 10.392825999999985, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-2-part-0/clust-9/min_adcl-dnaml"}, 
{"dnaml_tree_length": 6.66457, "rf_dist": 310, "raxml_tree_length": 7.9741440000000186, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-0-part-0/clust-7/min_adcl-dnaml"}, 
{"dnaml_tree_length": 7.6156000000000015, "rf_dist": 322, "raxml_tree_length": 9.005452999999994, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d765-m8/unseeded--50k-isub-0-part-0/clust-18/min_adcl-dnaml"}, 
{"dnaml_tree_length": 4.48452, "rf_dist": 246, "raxml_tree_length": 4.382172000000002, "outdir": "/home/matsengrp/working/eharkins/cft/output/qa013-synth-v19/QA013/QA013-l-d385/QA013.2-igl/seed-part-0/seed-cluster/min_adcl-dnaml"}
]

In summary, the RF distance isn't much improved it seems like, but the tree lengths are way closer to one another having optimized parameters.

matsen commented 5 years ago

I'm going to close this issue in favor of #170 , which has a more entertaining title.