veg / hyphy-analyses

HyPhy standalone analyses
MIT License
37 stars 17 forks source link

Option to disable rooting in ancestral reconstruction? #7

Open sdwfrost opened 4 years ago

sdwfrost commented 4 years ago

Hi @spond @stevenweaver et al

In the old batch files, there used to be an option to turn off rerooting of trees:

_DO_TREE_REBALANCE_ = 0;

I've tried adding this to the FitMG94.bf, but to no effect. What's the new way of taking a .seq file with FASTA sequences plus a rooted tree in order to get an ancestral reconstruction with the desired rooting?

stevenweaver commented 4 years ago

Dear @sdwfrost,

In the past, _DO_TREE_REBALANCE_ was typically used in conjunction with #include "queryTree.bf";.

I believe we should implement a method in ./res/TemplateBatchFiles/libv3/tasks/trees.bf for the sake of consistency, but in the meantime, you should be able to do the following to achieve the same effect within your batch file.

RerootTree(<your_tree>, 0);

where is either the Newick string or Tree object. Please let me know if this helps at all so that if there are further issues I can bump this issue up in scope and priority.

Best, Steven

sdwfrost commented 4 years ago

I'm just using this followed by this, and as I'm not familiar with the v3 lib, it's hard for me to see exactly where to put this in these files.

spond commented 4 years ago

Dear @sdwfrost,

You should be able to add the following line near the top of FitMG94.bf

utility.SetEnvVariable ("ACCEPT_ROOTED_TREES", TRUE);

Whether or not trees are automatically unrooted is still controlled by an environment variable.

Best, Sergei

sdwfrost commented 4 years ago

I tried adding this to FitMG94.bf as well as AncestralSequences.bf, but HyPhy is still rerooting the tree.

spond commented 4 years ago

Dear @sdwfrost,

This is not the behavior I am getting -- with utility.SetEnvVariable ("ACCEPT_ROOTED_TREES", TRUE); added to FitMG94.bf (it should not be necessary to also add it to AncestralSequences.bf), the rooted tree is carried through unchanged.

I added a command line flag to FitMG94.bf (--rooted) and also modified the test dataset (CD2.nex) to include a rooted tree that you can try it with commit 82b007d

The expected output is as follows

  1. --rooted No (default); note that the tree is not rooted
    
    ...
    ### Fitting Standard MG94
    * Log(L) = -3467.01, AIC-c =  6997.09 (31 estimated parameters)
    * non-synonymous/synonymous rate ratio =   0.9948 (95% profile CI   0.9071-  1.0884)

Synonymous tree

((((PIG:0.04981081805736993,COW:0.06412968281089403)Node5:0.02627408648833549,HORSE:0.05458276130527141,CAT:0.07072016608431618)Node4:0.01661169018634101,((RHMONKEY:0.0009609829254468343,BABOON:0.0004582378033667111)Node11:0.006703016455936003,(HUMAN:0,CHIMP:0.0004726577232010514)Node14:0.004623201761130269)Node10:0.02832759920079268)Node3:0.07339165061631374,RAT:0.01731388486140225,MOUSE:0.03103486689682247) ...


2. `--rooted Yes` (note the tree remains rooted and there's an extra estimated parameter)

...

Fitting Standard MG94

Synonymous tree

(MOUSE:0.0220160835262573,((((PIG:0.04976604141460345,COW:0.06403427583065359)Node5:0.0262653108223532,HORSE:0.05456900203613716,CAT:0.07075096071237615)Node4:0.01659722465552642,((RHMONKEY:0.0009594038325006256,BABOON:0.000457710608597962)Node11:0.006693042753276679,(HUMAN:0,CHIMP:0.0004720897835599658)Node14:0.004618809359029695)Node10:0.02830601267287156)Node3:0.07330581631788434,RAT:0.01729800416579532)Node2:0.009015110540486183) ...



Best,
Sergei 
gwct commented 2 years ago

Hi again, I am also trying to infer ancestral sequences with a rooted tree, however am finding that the --rooted Yes option does not do anything. In my simplest test, I give as input the tree ((a,b,)c);, and run the following command:

hyphy FitMG94.bf --alignment test.fa --tree test.tre --rooted Yes --save-fit test.fit --output test.json

I would expect the tree reported in the output json file to be rooted the same as the input, but it is unrooted:

"input":{
   "file name":"/home/gt156213e/bin/core/hyphy-interface/test/test.fa",
   "number of sequences":3,
   "number of sites":141,
   "partition count":1,
   "trees":{
     "0":"(c,a,b)"
    }
  },

I just updated hyphy to version 2.5.33 via bioconda (though hyphy --version still reports 2.5.32?), and I just pulled updates to the hyphy-analyses repo.

Separately, but with the same goal in mind, I tried adding ACCEPT_ROOTED_TREES = 1; to SLAC.bf, since we have to run SLAC anyways and it would be nice not to have to re-estimate ancestral sequences just to get them rooted. But I also found it had no effect on the trees (they were still unrooted by the program), though this admittedly has more room for error since SLAC is a bigger program.

Any help is appreciated! Thanks! -Gregg

spond commented 2 years ago

Dear @gwct,

The --rooted option works with an example 3-taxon file for me (see attached). Keep in mind that the relative lengths of the two branches coming off the "root" with --rooted Yes option are not going to be meaningful (they can't be estimated separately, only their sum can be under reversible models).

One thing to look out for is that if your alignment file has a tree in it, it will override the --tree argument.

hyphy FitMG94.bf --alignment CD2-3.txt --rooted Yes

....

### **Synonymous tree** 
(BABOON:0.007445763957431558,(HUMAN:0,CHIMP:0.000846008582569629)Node2:0.01374983065273461)

### **Non-synonymous tree** 
(BABOON:0.00843052183844512,(HUMAN:0,CHIMP:0.0009578995347746111)Node2:0.0155683484267728)
hyphy FitMG94.bf --alignment CD2-3.txt 

....

### **Synonymous tree** 
(HUMAN:0,CHIMP:0.0008459875213650508,BABOON:0.02119579279140764)

### **Non-synonymous tree** 
(HUMAN:0,CHIMP:0.0009569921406650095,BABOON:0.02397695782062061)

With SLAC, you should be able to do something like

hyphy slac --alignment CD2-3.txt --samples 0 ENV="ACCEPT_ROOTED_TREES=1;"
....

### Branches to include in the SLAC analysis
Selected 4 branches to include in SLAC calculations: `BABOON, HUMAN, CHIMP, Node2`

vs

hyphy slac --alignment CD2-3.txt --samples 0 
....

### Branches to include in the SLAC analysis
Selected 3 branches to include in SLAC calculations: `HUMAN, CHIMP, BABOON`

Best, Sergei

CD2-3.txt

gwct commented 2 years ago

Thanks, @spond. I can get the example you sent to run as a rooted tree. However, messing around a bit more with the 3-taxon trees, I noticed that it doesn't work if the input tree has no branch lengths and the outgroup is on the right side of the string.

For example, these four cases. In each case, the command I'm running is hyphy FitMG94.bf --alignment CD2-3.fa --tree CD2-3.tre --rooted Yes. I separated the tree and the alignment files, but the results are the same even if I use the Nexus file you sent.

  1. The tree you sent exactly (Baboon:0.003108,(Human:0.004349,Chimp:0.000799):0.011873);:
### **Synonymous tree**
(Baboon:0.007445763887323557,(Human:0,Chimp:0.0008460087618389288)Node2:0.01374983063201012)

### **Non-synonymous tree**
(Baboon:0.008430521830658774,(Human:0,Chimp:0.0009578997458883244)Node2:0.01556834853551739)

Rooted!

  1. The tree you sent without branch lengths (Baboon,(Human,Chimp));:
### **Synonymous tree**
(Baboon:0.007445763758481928,(Human:0,Chimp:0.0008460085168767602)Node2:0.01374983051084585)

### **Non-synonymous tree**
(Baboon:0.00843052165801737,(Human:0,Chimp:0.0009578994654875843)Node2:0.01556834834891241)

Same result, rooted.

  1. The tree you sent with branch lengths, but the outgroup on the right side of the tree string ((Human:0.004349,Chimp:0.000799):0.011873,Baboon:0.003108);
### **Synonymous tree**
((Human:0,Chimp:0.0008461983287669051)Node1:0.007331649715138724,Baboon:0.0138716222453892)

### **Non-synonymous tree**
((Human:0,Chimp:0.000957389843787852)Node1:0.008295037625177632,Baboon:0.01569437069670169)

Still good tree-wise, but I just noticed that the branch lengths for Baboon and the internal branch have been swapped relative to the above results. I guess this is consistent with what you said about the branches directly descending from the root being separately inestimable, just their sum (which is consistent here).

  1. The tree you sent without branch lengths and the outgroup on the right side of the tree string ((Human,Chimp),Baboon);:
### **Synonymous tree**
(Baboon:0.02123062128042735,Human:0,Chimp:0.0008466557760042213)

### **Non-synonymous tree**
(Baboon:0.02399164230926902,Human:0,Chimp:0.0009567625115001359)

Unrooted!

So my toy topology ((a,b,),c) fit the final case, which is why I saw an unrooted tree. This would be simple enough to avoid, but in my actual data I'm using a much larger tree (max 188 species) and my tests don't appear to follow case 4 above (I have branch lengths and the root clade is on the left of the tree string) and I'm still getting an unrooted tree in the results. But maybe looking into this simple case will address that as well.


SLAC, interestingly, appears to report a rooted tree in all 4 cases, but again in the final case (no branch lengths, outgroup on the right), the tree reported in the actual output file is unrooted. In all 4 cases the command I'm running is hyphy slac --alignment CD2-3.fa --tree CD2-3.tre --samples 0 ENV="ACCEPT_ROOTED_TREES=1;"

  1. The tree you sent exactly (Baboon:0.003108,(Human:0.004349,Chimp:0.000799):0.011873);. Both the output stream and the output file report a rooted tree:

Stream:

### Branches to include in the SLAC analysis
Selected 4 branches to include in SLAC calculations: `Baboon, Human, Chimp, Node2

File:

"input":{
   "file name":"/home/gt156213e/bin/core/hyphy-interface/test/CD2-3.fa",
   "number of sequences":3,
   "number of sites":187,
   "partition count":1,
   "trees":{
     "0":"(Baboon:0.003108,(Human:0.004349,Chimp:0.000799)Node2:0.011873)"
    }
  },
  1. The tree you sent without branch lengths `(Baboon,(Human,Chimp));. Both the output stream and the output file report a rooted tree:

Stream:

### Branches to include in the SLAC analysis
Selected 4 branches to include in SLAC calculations: `Baboon, Human, Chimp, Node2

File:

 "input":{
   "file name":"/home/gt156213e/bin/core/hyphy-interface/test/CD2-3.fa",
   "number of sequences":3,
   "number of sites":187,
   "partition count":1,
   "trees":{
     "0":"(Baboon,(Human,Chimp)Node2)"
    }
  },
  1. The tree you sent with branch lengths, but the outgroup on the right side of the tree string ((Human:0.004349,Chimp:0.000799):0.011873,Baboon:0.003108);. Both the output stream and the output file report a rooted tree:

Stream:

### Branches to include in the SLAC analysis
Selected 4 branches to include in SLAC calculations: `Human, Chimp, Node1, Baboon`

File:

"input":{
   "file name":"/home/gt156213e/bin/core/hyphy-interface/test/CD2-3.fa",
   "number of sequences":3,
   "number of sites":187,
   "partition count":1,
   "trees":{
     "0":"((Human:0.004349,Chimp:0.000799)Node1:0.011873,Baboon:0.003108)"
    }
  },
  1. The tree you sent without branch lengths and the outgroup on the right side of the tree string ((Human,Chimp),Baboon);. In this case, the output stream reports a rooted tree (with one internal node), but the output file still shows an unrooted tree!

Stream:

### Branches to include in the SLAC analysis
Selected 4 branches to include in SLAC calculations: `Human, Chimp, Node1, Baboon`

File:

"input":{
   "file name":"/home/gt156213e/bin/core/hyphy-interface/test/CD2-3.fa",
   "number of sequences":3,
   "number of sites":187,
   "partition count":1,
   "trees":{
     "0":"(Baboon,Human,Chimp)"
    }
  },

I'm not sure what to make of this. The "branch attributes" field in the output does have an internal node with results, but it is labeled as Node3 instead of Node1 as it is in the stream. So maybe it really is working on the rooted tree, but still reporting the unrooted one for some reason.

Sorry for the long post, and thanks again for all your help. -Gregg

spond commented 2 years ago

Dear @gwct,

Thanks for this detailed report! I can confirm the inconsistency and will track it down. I'll post a note here once a fix is available.

Best, Sergei

spond commented 2 years ago

Dear @gwct,

OK, so the issue is with the --kill-zero-lengths option (Yes by default). This option exists to accelerate fitting when there are zero internal branch lengths (collapsed internal branches, i.e. effective polytomies). If this occurs at the nucleotide fitting stage, HyPhy will simply delete those internal branches for codon model stages, creating hard polytomies. This speeds up model fitting, especially for large viral datasets (like SARS-CoV-2).

In the example I sent you, having the outgroup on the right creates this situation.

hyphy  FitMG94.bf --alignment CD2-3.txt --rooted Yes            
....
### **Synonymous tree** 
(BABOON:0.02119346424344714,HUMAN:0,CHIMP:0.0008457072089176565)

### **Non-synonymous tree** 
(BABOON:0.02400251851958738,HUMAN:0,CHIMP:0.0009578001364487133)
 hyphy FitMG94.bf --alignment CD2-3.txt --rooted Yes --kill-zero-lengths No
....
### **Synonymous tree** 
((HUMAN:0,CHIMP:0.0008460034661854671)Node1:0.00716115535599918,BABOON:0.0140471035947561)

### **Non-synonymous tree** 
((HUMAN:0,CHIMP:0.0009568843584813492)Node1:0.008099727510227859,BABOON:0.01588817806195902)

In the latter case the Node1 branch length is estimated >0, but remember that this is confounded with the BABOON branch length, so not stable.

Best, Sergei

gwct commented 2 years ago

Hi @spond, Great! Adding --kill-zero-lengths No seems to fix this issue, for both FitMG94.bf and SLAC. It works with the tree you sent as well as my larger tree. Thanks for your help again. -Gregg

spond commented 2 years ago

Dear @gwct,

Happy to help -- great questions.

Best, Sergei