Closed metasoarous closed 4 years ago
Thanks for thinking more broadly, @metasoarous .
I agree that things are a little outta hand. I had also thought about the PAUP/PRANK combination. I think that would be great, but I'd suggest doing some quick benchmarking first to make sure that PRANK is 🎯 . PRANK is primarily a multiple sequence alignment program, and has been rigorously benchmarked for that, but I'd guess less so for ASR.
My suggestion is that you just try out this combo and see if it basically works. If so, can you make it easy for @krdav to run some benchmarking? It doesn't have to be super extensive but I want to make sure it's not totally busted.
As mentioned elsewhere, we're putting this on ice till we resolve the issues with prank mentioned in #131.
The hot new ML tree program is called IQ-TREE, and apparently it does ASR: https://github.com/Cibiv/IQ-TREE/issues/9 . This would require testing first, given our experience with PRANK.
Good find! That would be great. We could also try the other program he mentions there if we have issues.
No parsimony, which I think used to be a criterion, but perhaps not now.
I was struck by how much extra code complexity phylip brings in. 😬
Yeah... it is major pain I wouldn't miss.
IQ-TREE's fast bootstrap approximation stuff seems nice as well.
I will be testing IQ-TREE, like I did with PRANK and other tools. Hopefully there is no funny results this time.
Update on IQ-TREE testing: I think we might have a better alternative to dnaml. Most importantly: there is no shuffling of ancestral nodes like with PRANK.
From the few tests I have by now it looks solid. The model option -m 000000
(JC) is a bit worse than dnaml default and -m 010010
(HKY) is a bit better. I actually thought that dnaml itself is running an HKY model, but I guess tree sampling is different.
The best thing about IQ-TREE is the speed and ease of running it. It is much faster and it is easier to run and parse output from IQ-TREE compared to dnaml.
And just to elaborate on the performance of IQ-TREE: A 500 taxa tree only took 7 mins.
You might want to skip the FastTree step altogether?
Great work @krdav! Thanks for your report.
I think we'll still want to do a trimming step to keep the viz sane. Also, I'd be a little worried that throwing all the sequences into the reconstruction would yield a lot of noise given the error rates.
Sure, trimming is necessary either way, but in regards to the ancestral reconstruction I don't see any advantage of doing this after trimming vs. before. The error rate should be the same, though the number of errors might be less.
Hmm, well maybe ^ is wrong in these case where you are trimming away leafs there are obviously running wild.
Perhaps @lauranoges and the ecgtheow group could weigh in here? They've been looking into the effect of sampling on the reconstructions.
From all our experiments in ecgtheow, it is unclear the right amount of trimming needed (if at all). It had originally been for speed purposes on your end, I believe, and BEAST can't handle the large taxa families in reasonable amounts of time as well. However, in a few cases, we have seen that the tree (and ASR) can be particularly sensitive to rogue/spurious taxa that show up from pruning steps. I imagine that with more spurious taxa left un-pruned, the rogue taxa problem may have a higher chance of popping up for a given CF. Even then, CFT is doing ML point estimation, which accounts for less uncertainty than does BEAST so pruning seems like a safe step to keep, although I will mention that @lauranoges and I have not properly investigated ASR invariance w.r.t. pruning vs. no pruning in ecgtheow.
It sounds like it would be reasonable to use IQ-TREE instead of fast tree and then use it again after pruning? Unless we don’t care enough pre-pruning to spend the time on it? (Remember we are creating hundreds of trees since we have so many seeds, not to mention the unseeded clusters for each patient sample). Hopefully I’m understanding correctly.
Either way, I’m eager to take a look at the IQ-TREE work but won’t be able to until back in town next week. Vienna is amazing....
Sent from my iPhone
On Nov 18, 2017, at 12:09 AM, Amrit Dhar notifications@github.com wrote:
From all our experiments in ecgtheow, it is unclear the right amount of trimming needed (if at all). It had originally been for speed purposes on your end, I believe, and BEAST can't handle the large taxa families in reasonable amounts of time as well. However, in a few cases, we have seen that the tree (and ASR) can be particularly sensitive to rogue/spurious taxa that show up from pruning steps. I imagine that with more spurious taxa left un-pruned, the rogue taxa problem may have a higher chance of popping up for a given CF. Even then, CFT is doing ML point estimation, which accounts for less uncertainty than does BEAST so pruning seems like a safe step to keep, although I will mention that @lauranoges and I have not investigated ASR invariance w.r.t. pruning vs. no pruning in ecgtheow.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@lauranoges I guess if it only took a little longer, it might be worth it, but I'd expect FastTree to be quite a bit faster.
Currently we're looking at RAxML for topology and model parameters, then DNAML for ASR.
I wish I had a :frankensteins-monster: emoji...
@matsen in comparing raxml-ng ASR to dnaml, trying to control all the variables we can like input tree, model parameters, etc, for a particular test case there were 7 ancestors (among 98) that dnaml reconstructed that raxml-ng did not reconstruct exactly. Comparing those 7 from dnaml to the closest (pairwise distance) raxml-ng reconstructed ancestor (hopefully this would correspond to the same ancestor in the tree) the maximum distance was 23 base pairs out of 336.
@afmagee says
I think at this point it's worth chatting with Erick, since (1) I'm out of ideas on how to confirm that this is exactly the right number but more importantly (2) if this is correct, then one-stop shopping in RAxML-ng may be possible and that would be nice for the pipeline
Thoughts on what should come next? Are there better ways to measure raxml-ng reconstructed ancestors in comparison to dnaml counterparts or should we go with this / repeat for other cases?
It is worth noting that we're doing this comparison because RAxML-ng claims to do the correct ASR approach, hence we could in theory just use RAxML-ng for everything we want to do*. Thus we set out to test if it was comparable by doing what should be the exact same ASR in both RAxML-ng and dnaml (same tree, no optimizing anything so fixed substitution model and branch lengths).
Of course this comparison has proved a total pain. RAxML-ng and dnaml number internal nodes differently so we're having a hard time ensuring that they're doing the same thing. What @eharkins is presenting is the best-case scenario, and the only comparison I was able to think of without pouring over source code in both programs. We take all unique ancestral sequences from dnaml and RAxML-ng and assume that if we find an exactly matched pair, they're the same reconstruction of the same node. This leaves 7 mismatches, which we then paired off into the 7 most closely matched pairs, which had no more than 23 mismatches out of 336 sites (and as few as 1).
I can't explain why there are any differences. RAxML-ng and dnaml use slightly different measures to compute the 4 rate categories under +G, which could drive the differences (as they are relatively few). Other than that I'm stumped.
*In fact, this would be very nice. These dataset have a peculiar lack of transitions that dnaml cannot account for at all, as it assumes transitions happen at least as often as they should under F81. So the ASR under RAxML-ng with GTR could be better if it's doing real ASR and not shortcut-ASR.
Yes, this would be hugely helpful. Thank you for helping out with this @afmagee .
The ideal would be to run https://github.com/matsengrp/bcr-phylo-benchmark to see if the new RAx is as good as dnaml (IQTree wasn't).
The main thing this requires is to write a file like this to parse the ASR output. I don't know how the RAx ASR output comes out and if it's close to any of the formats that were tested in the paper.
@matsen sounds good, I can do that. I think the output of raxml-ng may be unique for asr, but I will double check before I begin.
Regardless, the alternative we are testing against here is dnaml with optimized parameters being specified (transition / transversion ratio, the gamma rate parameter called alpha); doing so in bcr-phylo-benchmark would require extending code like https://github.com/matsengrp/bcr-phylo-benchmark/blob/master/bin/phylip_config.py to allow using the same tree, model, etc when comparing raxml-ng against dnaml for ASR only.
Well, hm, dnaml with optimized parameters being specified is yet another option, isn't it.
I would be happy to drop that option if we could show that RAx is equal to dnaml in Kristian's benchmarks.
Does that make sense?
Yes, we were trying to control other variables to just compare ASR methods, but if at the end of the day raxml-ng proves better than dnaml in bcr-phylo-benchmark terms, then we should go with that.
If not, we have alleviated some branch length issues (see #7 for a discussion of the issue) by modifying the dnaml model with optimized parameters from raxml. So if raxml-ng is ranked lower than dnaml by benchmark tests, the modified dnaml method may be worth benchmarking against default dnaml.
@eharkins this morning I've been thinking about how for any real work we're doing we are using linearham, and that this work is really just for visual display. So does it merit such a substantial amount of work?
I'd like to know that it's not-crazy, and that's just about it. I was thinking that we could check in with the authors and ask about what sorts of benchmarking/validation they have done. Does that seem reasonable, @afmagee @lauradoepker ?
Sorry for the whiplash.
@matsen The problem was identified by @psathyrella, who I believe uses CFT to get trees for local branching index calculations? So my understanding is that these trees are getting used and thus the pipeline needs to be fixed to get better branch lengths on that basis. I don't know beyond that how important it is to have good ASRs, but I do like the idea of not forcing Eli into doing more work than he's already put into this if we don't have to.
Yeah, the real ASR that @lauradoepker uses is done with linearham.
Kristian's repo is really testing ASR.
I'd trust the raxml folks to do a reasonable job with topology, even if their previous implementation of ASR was pretty wonky. Right?
If everyone agrees with these goals, what other straightforward checks should we use (if any) to make sure it's not fundamentally broken? (Shall we just email their listserv for evidence?)
Thanks, @afmagee !
That was indeed how I found the problem, but I was already convinced like a month ago to never trust the branch lengths any of these things give me. So, yeah, I think the yak is looking pretty scanty.
The plan after discussion IRL with @matsen and @afmagee I think can be summarized by 🔪 ~dnaml~, 🌱 raxml-ng, barring any surprises. I will update as necessary.
testing and cleaning up after making raxml-ng the default, it looks like 4 is the minimum number of sequences allowed by raxml-ng, not 3 (our current minimum in CFT). Ok to enforce 4 sequences min from here on out?
All my tree stuff has 10 as the default min, although I sometimes reduce that to 5. I really don't think there's very many scenarios we're likely to encounter where you can do something interesting with a 5 sequence tree.
Sounds good, I will require 4 sequences to process a clonal family in CFT. cc @lauradoepker @mmshipley
This means such clonal families wont get run through CFT at all; if we want to change it so that it processes the cluster annotation into a fasta, does everything else up until tree-building, I can do that.
Do we still use .svg
plots of maximum likelihood trees made by CFT?
e.g.
This seems like something made to be displayed by CFT web, and now that we have Olmsted I'd like to remove it if we are not using it.
If this is an important plot to have from CFT directly, then I will add this as part of the raxml-ng analysis since right now these svg plots only get created if dnaml is run.
After checking in with @mmshipley, the above PR #302 introduces raxml-ng as the default and:
Can we do a few spot-checks to compare results? For example, on a few lineages of interest for @mmshipley and on the one very strange tree we were getting because of model-misspecification?
On Fri, Jan 10, 2020 at 4:00 PM Elias Harkins notifications@github.com wrote:
After checking in with @mmshipley https://github.com/mmshipley, the above PR #302 https://github.com/matsengrp/cft/pull/302 introduces raxml-ng as the default and:
- enforces 4 seqs min/ clonal family
- does not make svg plots
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matsengrp/cft/issues/170?email_source=notifications&email_token=AAA3QRBQOUHTO56XPXUKKRLQ5ED2VA5CNFSM4DJF3R7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIVSBHI#issuecomment-573251741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3QRCGRA7QY6DOOQXK73LQ5ED2VANCNFSM4DJF3R7A .
-- Frederick "Erick" Matsen, Associate Member Fred Hutchinson Cancer Research Center http://matsen.fredhutch.org/
@matsen if I recall correctly the
very strange tree we were getting because of model-misspecification
were strange because they had really long branches. Running one of those same clonal families with raxml-ng gives a tree with less than half the total branch length of the dnaml tree, so it seems like it addresses that issue.
I ran also on one of @mmshipley's lineages of interest which can be seen on this local server. The first two clonal families in the table there should be the dnaml version of that family, then the raxml-ng. The latter looks reasonable compared to the former, but @matsen and @mmshipley should have a look to verify that.
Great!
@psathyrella would you want to look at the new trees?
@mmshipley the trees look fine to me, but perhaps you could take a look?
Well the original problem was that I had a check in my tree metric code that tries to determine if the trees are scaled correctly, since the lbi calculation is garbage if they aren't (since the decay length will be wrong). And, yes, it was getting tripped because it was getting trees where the annotation had like 10% mutation (depth 0.1) but the trees had depth two or three times that, and in some cases even greater than 1 (which was clearly spurious).
So the only way I can check is running my tree metric code on cft output, which I should do anyway to get mackenzie the tree metrics. Unfortunately my detailed check comparing shm to tree depth is too slow to turn on by default, but if the egregiuosly-too-long branches are still there, they'll still get caught by the hackey check. I'll also try to do some running with the better/slow check turned on.
To me the trees look fairly similar to one another. One thing I noticed is that in the RAxML tree, the seed sequence has a greater evolutionary distance from the inferred naive compared to the dnaml tree. I'm not sure whether this is expected based on differences in the two methods or not, but figured I would point it out.
Thank you Eli for running this comparison
@psathyrella let me know if I can help by providing cft output (dnaml and raxml-ng) in both the cases of:
hackey check.
That check is run by default in CFT and did not seem to get triggered by raxml-ng trees in the same way that it did for the dnaml trees for case 2.
ok, looks like it fixed it. I ran the detailed check ^ on the old and new trees from the family that originally caused the problem:
./bin/partis get-tree-metrics --outfname /fh/fast/matsen_e/processed-data/partis/qa013-synth/v19/partitions/QA013-g-d765-50k-isub-1/partition.yaml --treefname _output/cft-raxml-test/output/control-qa013-synth-v19/QA013/QA013-g-d765/unseeded--50k-isub-1-part-9/clust-0/min_adcl-dnaml/asr.nwk --cluster-indices 0 --tree-metric-fname lb.yaml
./bin/partis get-tree-metrics --outfname /fh/fast/matsen_e/processed-data/partis/qa013-synth/v19/partitions/QA013-g-d765-50k-isub-1/partition.yaml --treefname _output/cft-raxml-test/output/raxmlng-qa013-synth-v19/QA013/QA013-g-d765/unseeded--50k-isub-1-part-9/clust-0/min_adcl-raxml_ng/ASR.nwk --cluster-indices 0 --tree-metric-fname lb.yaml
old dnaml:
warning tree depth and mfreq differ by more than 50% for 99/99 nodes for inf tree
tree depth mfreq frac diff
1.2060 0.2833 0.7651 52788-1
0.3377 0.0944 0.7203 163401-1
0.6154 0.1722 0.7201 57651-1
0.7134 0.2028 0.7158 8076-1
0.6922 0.2056 0.7030 105629-1
0.6913 0.2056 0.7026 27879-1
0.6720 0.2000 0.7024 78954-1
0.5035 0.1528 0.6966 166797-1
0.5756 0.1750 0.6960 131714-1
0.4976 0.1528 0.6930 120292-1
0.5157 0.1583 0.6930 67228-1
0.5062 0.1556 0.6927 14414-1
0.5419 0.1667 0.6925 41527-1
0.5409 0.1667 0.6919 74261-1
0.6216 0.1917 0.6917 131753-1
0.5397 0.1667 0.6912 149545-1
0.6164 0.1917 0.6891 65479-1
[...]
warning pairwise distance from tree and sequence differ by more than 50% for 4851/4851 node pairs for inf tree
pairwise
tree dist seq dist frac diff
0.0192 0.0028 0.8558 2205-2 132-21
0.0190 0.0028 0.8540 58447-1 89435-1
0.0378 0.0055 0.8536 79905-1 85299-1
1.3986 0.3000 0.7855 74261-1 52788-1
1.4288 0.3083 0.7842 151770-1 52788-1
0.0254 0.0055 0.7823 1415-3 145702-1
0.0254 0.0055 0.7821 2205-2 99073-1
0.0254 0.0055 0.7820 58447-1 76047-1
0.0254 0.0055 0.7815 58447-1 59-37
0.0254 0.0055 0.7815 58447-1 132-21
0.0253 0.0055 0.7815 58447-1 58-38
0.0253 0.0055 0.7815 58447-1 154-18
0.0253 0.0055 0.7811 58447-1 8231-1
0.0253 0.0055 0.7809 58447-1 53-40
0.0253 0.0055 0.7809 58447-1 116429-1
[...]
vs for raxml-ng:
tree depth and mfreq differ by more than 50% for 0/99 nodes for inf tree
pairwise distance from tree and sequence differ by more than 50% for 6/4851 node pairs for inf tree
pairwise
tree dist seq dist frac diff
0.0088 0.0028 0.6862 515-7 143827-1
0.0118 0.0055 0.5299 145702-1 1415-3
0.0118 0.0055 0.5299 1415-3 2205-2
0.0118 0.0055 0.5299 89435-1 131517-1
0.0118 0.0055 0.5294 2216-2 79259-1
0.0118 0.0055 0.5291 85299-1 79905-1
and for the QA013.2 seed family doing the same thing
dnaml:
tree depth and mfreq differ by more than 50% for 0/99 nodes for inf tree
warning pairwise distance from tree and sequence differ by more than 50% for 58/4851 node pairs for inf tree
pairwise
tree dist seq dist frac diff
0.0106 0.0175 0.6503 QA013-g-d765-m8-MIG3496-12 QA013-g-dm379-m8-MIG6963-1
0.0142 0.0233 0.6471 QA013-g-d765-m8-MIG3496-12 QA013-g-dm379-m8-MIG13283-1
0.0124 0.0204 0.6458 QA013-g-dm379-m8-MIG8176-1 QA013-g-d765-m8-MIG3496-12
0.0196 0.0321 0.6396 QA013-g-d765-m8-MIG3496-12 QA013-g-dm351-m8-MIG3367-12
0.0161 0.0262 0.6318 QA013-g-d765-m8-MIG3496-12 QA013-g-dm379-m8-MIG5264-2
0.0254 0.0414 0.6314 QA013-g-d385-m8-MIG3651-11 QA013-g-dm351-m8-MIG3892-9
0.0235 0.0380 0.6196 QA013-g-dm379-m8-MIG12594-1 QA013-g-dm351-m8-MIG3367-12
0.0181 0.0292 0.6181 QA013-g-dm379-m8-MIG12594-1 QA013-g-dm379-m8-MIG13283-1
0.0289 0.0466 0.6141 QA013-g-d765-m8-MIG3496-12 QA013-g-d385-m8-MIG8713-1
0.0163 0.0263 0.6135 QA013-g-dm379-m8-MIG8176-1 QA013-g-dm379-m8-MIG12594-1
0.0145 0.0234 0.6121 QA013-g-dm379-m8-MIG12594-1 QA013-g-dm379-m8-MIG6963-1
0.0272 0.0439 0.6095 QA013-g-dm379-m8-MIG14032-1 QA013-g-dm351-m8-MIG3367-12
0.0200 0.0322 0.6090 QA013-g-dm379-m8-MIG12594-1 QA013-g-dm379-m8-MIG5264-2
0.0218 0.0351 0.6058 QA013-g-dm379-m8-MIG14032-1 QA013-g-dm379-m8-MIG13283-1
0.0055 0.0089 0.6050 QA013-g-dm379-m8-MIG2100-151 QA013-g-d385-m8-MIG3651-11
0.0328 0.0526 0.6041 QA013-g-dm379-m8-MIG12594-1 QA013-g-d385-m8-MIG8713-1
0.0201 0.0322 0.6010 QA013-g-dm379-m8-MIG8176-1 QA013-g-dm379-m8-MIG14032-1
0.0238 0.0380 0.5991 QA013-g-dm379-m8-MIG14032-1 QA013-g-dm379-m8-MIG5264-2
0.0183 0.0292 0.5987 QA013-g-dm379-m8-MIG14032-1 QA013-g-dm379-m8-MIG6963-1
0.0256 0.0408 0.5963 QA013-g-dm379-m8-MIG13263-1 QA013-g-d765-m8-MIG3496-12
0.0295 0.0468 0.5870 QA013-g-dm379-m8-MIG13263-1 QA013-g-dm379-m8-MIG12594-1
0.0037 0.0058 0.5802 QA013-g-d765-m8-MIG3496-12 QA013-g-dm379-m8-MIG10581-1
[...]
vs for raxml-ng:
tree depth and mfreq differ by more than 50% for 0/99 nodes for inf tree
pairwise distance from tree and sequence differ by more than 50% for 4/4851 node pairs for inf tree
pairwise
tree dist seq dist frac diff
0.2616 0.1281 0.5103 QA013-g-dm379-m8-MIG14146-1 QA013-g-dm379-m8-MIG14518-1
0.2484 0.1226 0.5065 QA013-g-dm379-m8-MIG14146-1 QA013-g-d765-m8-MIG4085-4
0.1878 0.0937 0.5013 QA013-g-dm379-m8-MIG1787-235 QA013-g-d765-m8-MIG4085-4
0.2399 0.1198 0.5007 QA013-g-dm351-m8-MIG2816-17 QA013-g-d765-m8-MIG4085-4
summary: using raxml drastically reduces the number of distances that are very different in mutation freq/hamming distance vs branch length, in both pairwise between nodes, and depth. So, success! thanks a lot.
And yeah, we never saw any indication that this problem was either giving crappy topologies, or that anything mackenzie or laura was doing was overly affected by the crappy branch lengths.
It recently came up in #155 and #146 that @lauranoges et al would be okay sticking with just a single tree building program, as long as the x-axis was "interpretable" for them. Erick pointed out that since any program we use is going to have substantial uncertainty, it's important we make progress on #130 (computing branch support). But the catch with #130 is that for dnapars, it means further developing around our already insane Phylip processing in order to parse out the extra trees and run them through a program for computing support. And the idea in #131 of using PRANK to do Ancestral State Reconstruction (ASR) opens up the possibility of picking up #149 and dropping dnaml/dnapars for PAUP. The only reason we put it down is that it didn't look like PAUP had a sane way of spitting out it's ASR, but if PRANK is doing this job, we could start building around its saner input/output model. Also, I have to tinker around with it a bit more, but it seems that it has options for doing support computation built in. And it seems we'd still be able to do both ML and parsimony. So that's looking like an attractive option to me.
While I feel #155 was definitely worth while, and want to maintain at least the general ability to run with different tree building tools, it would be nice if we're at least for our own uses steering towards a more defined direction, to minimize grunt work. So all I'm really asking here is: which tool do we really want to focus on? Pending investigation of the pertinent details around PAUP, I'd recommend we head in that general direction, but would like to open this up for discussion.