matsengrp / cft

Clonal family tree
5 stars 3 forks source link

replace default tree building program (dnaml) with raxml-ng #170

Closed metasoarous closed 4 years ago

metasoarous commented 7 years ago

It recently came up in #155 and #146 that @lauranoges et al would be okay sticking with just a single tree building program, as long as the x-axis was "interpretable" for them. Erick pointed out that since any program we use is going to have substantial uncertainty, it's important we make progress on #130 (computing branch support). But the catch with #130 is that for dnapars, it means further developing around our already insane Phylip processing in order to parse out the extra trees and run them through a program for computing support. And the idea in #131 of using PRANK to do Ancestral State Reconstruction (ASR) opens up the possibility of picking up #149 and dropping dnaml/dnapars for PAUP. The only reason we put it down is that it didn't look like PAUP had a sane way of spitting out it's ASR, but if PRANK is doing this job, we could start building around its saner input/output model. Also, I have to tinker around with it a bit more, but it seems that it has options for doing support computation built in. And it seems we'd still be able to do both ML and parsimony. So that's looking like an attractive option to me.

While I feel #155 was definitely worth while, and want to maintain at least the general ability to run with different tree building tools, it would be nice if we're at least for our own uses steering towards a more defined direction, to minimize grunt work. So all I'm really asking here is: which tool do we really want to focus on? Pending investigation of the pertinent details around PAUP, I'd recommend we head in that general direction, but would like to open this up for discussion.

matsen commented 7 years ago

Thanks for thinking more broadly, @metasoarous .

I agree that things are a little outta hand. I had also thought about the PAUP/PRANK combination. I think that would be great, but I'd suggest doing some quick benchmarking first to make sure that PRANK is 🎯 . PRANK is primarily a multiple sequence alignment program, and has been rigorously benchmarked for that, but I'd guess less so for ASR.

My suggestion is that you just try out this combo and see if it basically works. If so, can you make it easy for @krdav to run some benchmarking? It doesn't have to be super extensive but I want to make sure it's not totally busted.

metasoarous commented 7 years ago

As mentioned elsewhere, we're putting this on ice till we resolve the issues with prank mentioned in #131.

matsen commented 7 years ago

The hot new ML tree program is called IQ-TREE, and apparently it does ASR: https://github.com/Cibiv/IQ-TREE/issues/9 . This would require testing first, given our experience with PRANK.

metasoarous commented 7 years ago

Good find! That would be great. We could also try the other program he mentions there if we have issues.

matsen commented 7 years ago

No parsimony, which I think used to be a criterion, but perhaps not now.

I was struck by how much extra code complexity phylip brings in. 😬

metasoarous commented 7 years ago

Yeah... it is major pain I wouldn't miss.

IQ-TREE's fast bootstrap approximation stuff seems nice as well.

krdav commented 7 years ago

I will be testing IQ-TREE, like I did with PRANK and other tools. Hopefully there is no funny results this time.

krdav commented 6 years ago

Update on IQ-TREE testing: I think we might have a better alternative to dnaml. Most importantly: there is no shuffling of ancestral nodes like with PRANK.

From the few tests I have by now it looks solid. The model option -m 000000 (JC) is a bit worse than dnaml default and -m 010010 (HKY) is a bit better. I actually thought that dnaml itself is running an HKY model, but I guess tree sampling is different. The best thing about IQ-TREE is the speed and ease of running it. It is much faster and it is easier to run and parse output from IQ-TREE compared to dnaml.

validaggreg_box.pdf

krdav commented 6 years ago

And just to elaborate on the performance of IQ-TREE: A 500 taxa tree only took 7 mins.

You might want to skip the FastTree step altogether?

metasoarous commented 6 years ago

Great work @krdav! Thanks for your report.

I think we'll still want to do a trimming step to keep the viz sane. Also, I'd be a little worried that throwing all the sequences into the reconstruction would yield a lot of noise given the error rates.

krdav commented 6 years ago

Sure, trimming is necessary either way, but in regards to the ancestral reconstruction I don't see any advantage of doing this after trimming vs. before. The error rate should be the same, though the number of errors might be less.

Hmm, well maybe ^ is wrong in these case where you are trimming away leafs there are obviously running wild.

metasoarous commented 6 years ago

Perhaps @lauranoges and the ecgtheow group could weigh in here? They've been looking into the effect of sampling on the reconstructions.

dunleavy005 commented 6 years ago

From all our experiments in ecgtheow, it is unclear the right amount of trimming needed (if at all). It had originally been for speed purposes on your end, I believe, and BEAST can't handle the large taxa families in reasonable amounts of time as well. However, in a few cases, we have seen that the tree (and ASR) can be particularly sensitive to rogue/spurious taxa that show up from pruning steps. I imagine that with more spurious taxa left un-pruned, the rogue taxa problem may have a higher chance of popping up for a given CF. Even then, CFT is doing ML point estimation, which accounts for less uncertainty than does BEAST so pruning seems like a safe step to keep, although I will mention that @lauranoges and I have not properly investigated ASR invariance w.r.t. pruning vs. no pruning in ecgtheow.

lauradoepker commented 6 years ago

It sounds like it would be reasonable to use IQ-TREE instead of fast tree and then use it again after pruning? Unless we don’t care enough pre-pruning to spend the time on it? (Remember we are creating hundreds of trees since we have so many seeds, not to mention the unseeded clusters for each patient sample). Hopefully I’m understanding correctly.

Either way, I’m eager to take a look at the IQ-TREE work but won’t be able to until back in town next week. Vienna is amazing....

Sent from my iPhone

On Nov 18, 2017, at 12:09 AM, Amrit Dhar notifications@github.com wrote:

From all our experiments in ecgtheow, it is unclear the right amount of trimming needed (if at all). It had originally been for speed purposes on your end, I believe, and BEAST can't handle the large taxa families in reasonable amounts of time as well. However, in a few cases, we have seen that the tree (and ASR) can be particularly sensitive to rogue/spurious taxa that show up from pruning steps. I imagine that with more spurious taxa left un-pruned, the rogue taxa problem may have a higher chance of popping up for a given CF. Even then, CFT is doing ML point estimation, which accounts for less uncertainty than does BEAST so pruning seems like a safe step to keep, although I will mention that @lauranoges and I have not investigated ASR invariance w.r.t. pruning vs. no pruning in ecgtheow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

metasoarous commented 6 years ago

@lauranoges I guess if it only took a little longer, it might be worth it, but I'd expect FastTree to be quite a bit faster.

matsen commented 5 years ago

Currently we're looking at RAxML for topology and model parameters, then DNAML for ASR.

metasoarous commented 5 years ago

I wish I had a :frankensteins-monster: emoji...

eharkins commented 5 years ago

@matsen in comparing raxml-ng ASR to dnaml, trying to control all the variables we can like input tree, model parameters, etc, for a particular test case there were 7 ancestors (among 98) that dnaml reconstructed that raxml-ng did not reconstruct exactly. Comparing those 7 from dnaml to the closest (pairwise distance) raxml-ng reconstructed ancestor (hopefully this would correspond to the same ancestor in the tree) the maximum distance was 23 base pairs out of 336.

@afmagee says

I think at this point it's worth chatting with Erick, since (1) I'm out of ideas on how to confirm that this is exactly the right number but more importantly (2) if this is correct, then one-stop shopping in RAxML-ng may be possible and that would be nice for the pipeline

Thoughts on what should come next? Are there better ways to measure raxml-ng reconstructed ancestors in comparison to dnaml counterparts or should we go with this / repeat for other cases?

afmagee commented 5 years ago

It is worth noting that we're doing this comparison because RAxML-ng claims to do the correct ASR approach, hence we could in theory just use RAxML-ng for everything we want to do*. Thus we set out to test if it was comparable by doing what should be the exact same ASR in both RAxML-ng and dnaml (same tree, no optimizing anything so fixed substitution model and branch lengths).

Of course this comparison has proved a total pain. RAxML-ng and dnaml number internal nodes differently so we're having a hard time ensuring that they're doing the same thing. What @eharkins is presenting is the best-case scenario, and the only comparison I was able to think of without pouring over source code in both programs. We take all unique ancestral sequences from dnaml and RAxML-ng and assume that if we find an exactly matched pair, they're the same reconstruction of the same node. This leaves 7 mismatches, which we then paired off into the 7 most closely matched pairs, which had no more than 23 mismatches out of 336 sites (and as few as 1).

I can't explain why there are any differences. RAxML-ng and dnaml use slightly different measures to compute the 4 rate categories under +G, which could drive the differences (as they are relatively few). Other than that I'm stumped.

*In fact, this would be very nice. These dataset have a peculiar lack of transitions that dnaml cannot account for at all, as it assumes transitions happen at least as often as they should under F81. So the ASR under RAxML-ng with GTR could be better if it's doing real ASR and not shortcut-ASR.

matsen commented 5 years ago

Yes, this would be hugely helpful. Thank you for helping out with this @afmagee .

The ideal would be to run https://github.com/matsengrp/bcr-phylo-benchmark to see if the new RAx is as good as dnaml (IQTree wasn't).

The main thing this requires is to write a file like this to parse the ASR output. I don't know how the RAx ASR output comes out and if it's close to any of the formats that were tested in the paper.

eharkins commented 5 years ago

@matsen sounds good, I can do that. I think the output of raxml-ng may be unique for asr, but I will double check before I begin.

Regardless, the alternative we are testing against here is dnaml with optimized parameters being specified (transition / transversion ratio, the gamma rate parameter called alpha); doing so in bcr-phylo-benchmark would require extending code like https://github.com/matsengrp/bcr-phylo-benchmark/blob/master/bin/phylip_config.py to allow using the same tree, model, etc when comparing raxml-ng against dnaml for ASR only.

matsen commented 5 years ago

Well, hm, dnaml with optimized parameters being specified is yet another option, isn't it.

I would be happy to drop that option if we could show that RAx is equal to dnaml in Kristian's benchmarks.

Does that make sense?

eharkins commented 5 years ago

Yes, we were trying to control other variables to just compare ASR methods, but if at the end of the day raxml-ng proves better than dnaml in bcr-phylo-benchmark terms, then we should go with that.

If not, we have alleviated some branch length issues (see #7 for a discussion of the issue) by modifying the dnaml model with optimized parameters from raxml. So if raxml-ng is ranked lower than dnaml by benchmark tests, the modified dnaml method may be worth benchmarking against default dnaml.

matsen commented 5 years ago

@eharkins this morning I've been thinking about how for any real work we're doing we are using linearham, and that this work is really just for visual display. So does it merit such a substantial amount of work?

I'd like to know that it's not-crazy, and that's just about it. I was thinking that we could check in with the authors and ask about what sorts of benchmarking/validation they have done. Does that seem reasonable, @afmagee @lauradoepker ?

Sorry for the whiplash.

afmagee commented 5 years ago

@matsen The problem was identified by @psathyrella, who I believe uses CFT to get trees for local branching index calculations? So my understanding is that these trees are getting used and thus the pipeline needs to be fixed to get better branch lengths on that basis. I don't know beyond that how important it is to have good ASRs, but I do like the idea of not forcing Eli into doing more work than he's already put into this if we don't have to.

matsen commented 5 years ago

Yeah, the real ASR that @lauradoepker uses is done with linearham.

Kristian's repo is really testing ASR.

I'd trust the raxml folks to do a reasonable job with topology, even if their previous implementation of ASR was pretty wonky. Right?

If everyone agrees with these goals, what other straightforward checks should we use (if any) to make sure it's not fundamentally broken? (Shall we just email their listserv for evidence?)

Thanks, @afmagee !

psathyrella commented 5 years ago

That was indeed how I found the problem, but I was already convinced like a month ago to never trust the branch lengths any of these things give me. So, yeah, I think the yak is looking pretty scanty.

eharkins commented 5 years ago

The plan after discussion IRL with @matsen and @afmagee I think can be summarized by 🔪 ~dnaml~, 🌱 raxml-ng, barring any surprises. I will update as necessary.

eharkins commented 4 years ago

testing and cleaning up after making raxml-ng the default, it looks like 4 is the minimum number of sequences allowed by raxml-ng, not 3 (our current minimum in CFT). Ok to enforce 4 sequences min from here on out?

psathyrella commented 4 years ago

All my tree stuff has 10 as the default min, although I sometimes reduce that to 5. I really don't think there's very many scenarios we're likely to encounter where you can do something interesting with a 5 sequence tree.

eharkins commented 4 years ago

Sounds good, I will require 4 sequences to process a clonal family in CFT. cc @lauradoepker @mmshipley

This means such clonal families wont get run through CFT at all; if we want to change it so that it processes the cluster annotation into a fasta, does everything else up until tree-building, I can do that.

eharkins commented 4 years ago

Do we still use .svg plots of maximum likelihood trees made by CFT? e.g. Screen Shot 2020-01-08 at 5 45 33 PM

This seems like something made to be displayed by CFT web, and now that we have Olmsted I'd like to remove it if we are not using it.

If this is an important plot to have from CFT directly, then I will add this as part of the raxml-ng analysis since right now these svg plots only get created if dnaml is run.

eharkins commented 4 years ago

After checking in with @mmshipley, the above PR #302 introduces raxml-ng as the default and:

matsen commented 4 years ago

Can we do a few spot-checks to compare results? For example, on a few lineages of interest for @mmshipley and on the one very strange tree we were getting because of model-misspecification?

On Fri, Jan 10, 2020 at 4:00 PM Elias Harkins notifications@github.com wrote:

After checking in with @mmshipley https://github.com/mmshipley, the above PR #302 https://github.com/matsengrp/cft/pull/302 introduces raxml-ng as the default and:

  • enforces 4 seqs min/ clonal family
  • does not make svg plots

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matsengrp/cft/issues/170?email_source=notifications&email_token=AAA3QRBQOUHTO56XPXUKKRLQ5ED2VA5CNFSM4DJF3R7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIVSBHI#issuecomment-573251741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3QRCGRA7QY6DOOQXK73LQ5ED2VANCNFSM4DJF3R7A .

-- Frederick "Erick" Matsen, Associate Member Fred Hutchinson Cancer Research Center http://matsen.fredhutch.org/

eharkins commented 4 years ago

@matsen if I recall correctly the

very strange tree we were getting because of model-misspecification

were strange because they had really long branches. Running one of those same clonal families with raxml-ng gives a tree with less than half the total branch length of the dnaml tree, so it seems like it addresses that issue.

I ran also on one of @mmshipley's lineages of interest which can be seen on this local server. The first two clonal families in the table there should be the dnaml version of that family, then the raxml-ng. The latter looks reasonable compared to the former, but @matsen and @mmshipley should have a look to verify that.

matsen commented 4 years ago

Great!

@psathyrella would you want to look at the new trees?

@mmshipley the trees look fine to me, but perhaps you could take a look?

psathyrella commented 4 years ago

Well the original problem was that I had a check in my tree metric code that tries to determine if the trees are scaled correctly, since the lbi calculation is garbage if they aren't (since the decay length will be wrong). And, yes, it was getting tripped because it was getting trees where the annotation had like 10% mutation (depth 0.1) but the trees had depth two or three times that, and in some cases even greater than 1 (which was clearly spurious).

So the only way I can check is running my tree metric code on cft output, which I should do anyway to get mackenzie the tree metrics. Unfortunately my detailed check comparing shm to tree depth is too slow to turn on by default, but if the egregiuosly-too-long branches are still there, they'll still get caught by the hackey check. I'll also try to do some running with the better/slow check turned on.

mmshipley commented 4 years ago

To me the trees look fairly similar to one another. One thing I noticed is that in the RAxML tree, the seed sequence has a greater evolutionary distance from the inferred naive compared to the dnaml tree. I'm not sure whether this is expected based on differences in the two methods or not, but figured I would point it out.

Thank you Eli for running this comparison

eharkins commented 4 years ago

@psathyrella let me know if I can help by providing cft output (dnaml and raxml-ng) in both the cases of:

  1. the clonal family of interest for Mackenzie being discussed above
  2. the clonal families that originally triggered the

    hackey check.

That check is run by default in CFT and did not seem to get triggered by raxml-ng trees in the same way that it did for the dnaml trees for case 2.

psathyrella commented 4 years ago

ok, looks like it fixed it. I ran the detailed check ^ on the old and new trees from the family that originally caused the problem:

./bin/partis get-tree-metrics --outfname /fh/fast/matsen_e/processed-data/partis/qa013-synth/v19/partitions/QA013-g-d765-50k-isub-1/partition.yaml --treefname _output/cft-raxml-test/output/control-qa013-synth-v19/QA013/QA013-g-d765/unseeded--50k-isub-1-part-9/clust-0/min_adcl-dnaml/asr.nwk --cluster-indices 0 --tree-metric-fname lb.yaml
./bin/partis get-tree-metrics --outfname /fh/fast/matsen_e/processed-data/partis/qa013-synth/v19/partitions/QA013-g-d765-50k-isub-1/partition.yaml --treefname _output/cft-raxml-test/output/raxmlng-qa013-synth-v19/QA013/QA013-g-d765/unseeded--50k-isub-1-part-9/clust-0/min_adcl-raxml_ng/ASR.nwk --cluster-indices 0 --tree-metric-fname lb.yaml

old dnaml:

        warning tree depth and mfreq differ by more than 50% for 99/99 nodes for inf tree
    tree depth   mfreq    frac diff
      1.2060    0.2833     0.7651     52788-1
      0.3377    0.0944     0.7203     163401-1
      0.6154    0.1722     0.7201     57651-1
      0.7134    0.2028     0.7158     8076-1
      0.6922    0.2056     0.7030     105629-1
      0.6913    0.2056     0.7026     27879-1
      0.6720    0.2000     0.7024     78954-1
      0.5035    0.1528     0.6966     166797-1
      0.5756    0.1750     0.6960     131714-1
      0.4976    0.1528     0.6930     120292-1
      0.5157    0.1583     0.6930     67228-1
      0.5062    0.1556     0.6927     14414-1
      0.5419    0.1667     0.6925     41527-1
      0.5409    0.1667     0.6919     74261-1
      0.6216    0.1917     0.6917     131753-1
      0.5397    0.1667     0.6912     149545-1
      0.6164    0.1917     0.6891     65479-1
[...]
        warning pairwise distance from tree and sequence differ by more than 50% for 4851/4851 node pairs for inf tree
          pairwise
     tree dist  seq dist  frac diff
      0.0192     0.0028    0.8558    2205-2  132-21
      0.0190     0.0028    0.8540    58447-1  89435-1
      0.0378     0.0055    0.8536    79905-1  85299-1
      1.3986     0.3000    0.7855    74261-1  52788-1
      1.4288     0.3083    0.7842    151770-1  52788-1
      0.0254     0.0055    0.7823    1415-3  145702-1
      0.0254     0.0055    0.7821    2205-2  99073-1
      0.0254     0.0055    0.7820    58447-1  76047-1
      0.0254     0.0055    0.7815    58447-1  59-37
      0.0254     0.0055    0.7815    58447-1  132-21
      0.0253     0.0055    0.7815    58447-1  58-38
      0.0253     0.0055    0.7815    58447-1  154-18
      0.0253     0.0055    0.7811    58447-1  8231-1
      0.0253     0.0055    0.7809    58447-1  53-40
      0.0253     0.0055    0.7809    58447-1  116429-1
[...]

vs for raxml-ng:

        tree depth and mfreq differ by more than 50% for 0/99 nodes for inf tree
        pairwise distance from tree and sequence differ by more than 50% for 6/4851 node pairs for inf tree
          pairwise
     tree dist  seq dist  frac diff
      0.0088     0.0028    0.6862    515-7  143827-1
      0.0118     0.0055    0.5299    145702-1  1415-3
      0.0118     0.0055    0.5299    1415-3  2205-2
      0.0118     0.0055    0.5299    89435-1  131517-1
      0.0118     0.0055    0.5294    2216-2  79259-1
      0.0118     0.0055    0.5291    85299-1  79905-1

and for the QA013.2 seed family doing the same thing

dnaml:

        tree depth and mfreq differ by more than 50% for 0/99 nodes for inf tree
        warning pairwise distance from tree and sequence differ by more than 50% for 58/4851 node pairs for inf tree
          pairwise
     tree dist  seq dist  frac diff
      0.0106     0.0175    0.6503    QA013-g-d765-m8-MIG3496-12  QA013-g-dm379-m8-MIG6963-1
      0.0142     0.0233    0.6471    QA013-g-d765-m8-MIG3496-12  QA013-g-dm379-m8-MIG13283-1
      0.0124     0.0204    0.6458    QA013-g-dm379-m8-MIG8176-1  QA013-g-d765-m8-MIG3496-12
      0.0196     0.0321    0.6396    QA013-g-d765-m8-MIG3496-12  QA013-g-dm351-m8-MIG3367-12
      0.0161     0.0262    0.6318    QA013-g-d765-m8-MIG3496-12  QA013-g-dm379-m8-MIG5264-2
      0.0254     0.0414    0.6314    QA013-g-d385-m8-MIG3651-11  QA013-g-dm351-m8-MIG3892-9
      0.0235     0.0380    0.6196    QA013-g-dm379-m8-MIG12594-1  QA013-g-dm351-m8-MIG3367-12
      0.0181     0.0292    0.6181    QA013-g-dm379-m8-MIG12594-1  QA013-g-dm379-m8-MIG13283-1
      0.0289     0.0466    0.6141    QA013-g-d765-m8-MIG3496-12  QA013-g-d385-m8-MIG8713-1
      0.0163     0.0263    0.6135    QA013-g-dm379-m8-MIG8176-1  QA013-g-dm379-m8-MIG12594-1
      0.0145     0.0234    0.6121    QA013-g-dm379-m8-MIG12594-1  QA013-g-dm379-m8-MIG6963-1
      0.0272     0.0439    0.6095    QA013-g-dm379-m8-MIG14032-1  QA013-g-dm351-m8-MIG3367-12
      0.0200     0.0322    0.6090    QA013-g-dm379-m8-MIG12594-1  QA013-g-dm379-m8-MIG5264-2
      0.0218     0.0351    0.6058    QA013-g-dm379-m8-MIG14032-1  QA013-g-dm379-m8-MIG13283-1
      0.0055     0.0089    0.6050    QA013-g-dm379-m8-MIG2100-151  QA013-g-d385-m8-MIG3651-11
      0.0328     0.0526    0.6041    QA013-g-dm379-m8-MIG12594-1  QA013-g-d385-m8-MIG8713-1
      0.0201     0.0322    0.6010    QA013-g-dm379-m8-MIG8176-1  QA013-g-dm379-m8-MIG14032-1
      0.0238     0.0380    0.5991    QA013-g-dm379-m8-MIG14032-1  QA013-g-dm379-m8-MIG5264-2
      0.0183     0.0292    0.5987    QA013-g-dm379-m8-MIG14032-1  QA013-g-dm379-m8-MIG6963-1
      0.0256     0.0408    0.5963    QA013-g-dm379-m8-MIG13263-1  QA013-g-d765-m8-MIG3496-12
      0.0295     0.0468    0.5870    QA013-g-dm379-m8-MIG13263-1  QA013-g-dm379-m8-MIG12594-1
      0.0037     0.0058    0.5802    QA013-g-d765-m8-MIG3496-12  QA013-g-dm379-m8-MIG10581-1
[...]

vs for raxml-ng:

        tree depth and mfreq differ by more than 50% for 0/99 nodes for inf tree
        pairwise distance from tree and sequence differ by more than 50% for 4/4851 node pairs for inf tree
          pairwise
     tree dist  seq dist  frac diff
      0.2616     0.1281    0.5103    QA013-g-dm379-m8-MIG14146-1  QA013-g-dm379-m8-MIG14518-1
      0.2484     0.1226    0.5065    QA013-g-dm379-m8-MIG14146-1  QA013-g-d765-m8-MIG4085-4
      0.1878     0.0937    0.5013    QA013-g-dm379-m8-MIG1787-235  QA013-g-d765-m8-MIG4085-4
      0.2399     0.1198    0.5007    QA013-g-dm351-m8-MIG2816-17  QA013-g-d765-m8-MIG4085-4

summary: using raxml drastically reduces the number of distances that are very different in mutation freq/hamming distance vs branch length, in both pairwise between nodes, and depth. So, success! thanks a lot.

And yeah, we never saw any indication that this problem was either giving crappy topologies, or that anything mackenzie or laura was doing was overly affected by the crappy branch lengths.

eharkins commented 4 years ago

302 closes this.