jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
210 stars 61 forks source link

reroot_at_edge() causes trifurcation and incorrect branch lengths #30

Closed donovan-h-parks closed 9 years ago

donovan-h-parks commented 9 years ago

The Tree.reroot_at_edge() method does not work as described. In version 4.0.2 and 4.0.3, it results in a trifurcation and does not respect the provided edge lengths:

newick_str = '(((A:0.5,B:0.5):0.5,C:0.5):0.5,(D:0.5,E:0.5):0.5);
tree = dendropy.Tree.get(data=newick_str, schema='newick')

mrca = tree.mrca(taxon_labels=('A', 'B'))
print 'mcra.edge_length', mrca.edge_length

tree.reroot_at_edge(mrca.edge, length1=0.5 * mrca.edge_length, length2=0.5 * mrca.edge_length, update_bipartitions=True)

for child in tree.seed_node.child_nodes():
    print 'child.edge_length', child.edge_length

tree.write_to_path('rooted.tree', schema='newick')

The output is:

mcra.edge_length 0.5
child.edge_length 0.5
child.edge_length 0.5
child.edge_length 1.0

The resulting tree is: [&R] ((A:0.5,B:0.5):0.5,C:0.5,(D:0.5,E:0.5):1.0);

Note that this tree incorrectly has a trifurcation and that the edge lengths to the root are not 0.25 as expected (i.e., half the edge length of the parent branch from the MCRA node).

jeetsukumaran commented 9 years ago

The problem here is that you are reading in an unrooted tree (there is no rooting token, and a rooting state is not specified when when reading the tree). Encoding the bipartitions on an unrooted tree creates a basal trifurcation. This is noted in the documentation in a number of places, including, for e.g., the primer sections on bipartitions and the primer section on tree manipulations. It has also been covered in depth in several discussions in this group. Please read the documentation closely, especially the primer. E.g.:

http://pythonhosted.org/DendroPy/primer/treemanips.html#setting-the-rooting-state

To get the behavior you expect, you need to read in the tree as rooted by at least one of the following means:

(1) Ensure that the tree string has the rooting token:

newick_str = '[&R] (((A:0.5 ...'

(2) Force a rooting state when reading the tree by passing in the rooting="force-rooted" keyword argument:

tree = dendropy.Tree.get(data=newick_str,
                              schema='newick',
                              rooting='force-rooted')

(3) Explicitly set the rooting state after the tree is read by before an operations or calculations on the tree:

     tree.is_rooted = True

Any one of these will do, but following code implements all three plus a sanity check:

     #! /usr/bin/env python
     import dendropy
     newick_str = '[&R](((A:0.5,B:0.5):0.5,C:0.5):0.5,(D:0.5,E:0.5):0.5);'
     tree = dendropy.Tree.get(
         data=newick_str,
         schema='newick',
         rooting='force-rooted',
         )
     tree.is_rooted = True
     assert tree.is_rooted
     mrca = tree.mrca(taxon_labels=('A', 'B'))
     print 'mcra.edge_length', mrca.edge_length
     tree.reroot_at_edge(mrca.edge, length1=0.5 * mrca.edge_length, 
length2=0.5 * mrca.edge_length, update_bipartitions=True)
     for child in tree.seed_node.child_nodes():
         print 'child.edge_length', child.edge_length
     tree.write_to_path('rooted.tree', schema='newick')

This results in:

     mcra.edge_length 0.5
     child.edge_length 0.25
     child.edge_length 0.25
     [&R] ((A:0.5,B:0.5):0.25,(C:0.5,(D:0.5,E:0.5):1.0):0.25);

On 9/4/15 2:59 AM, Donovan Parks wrote:

|newick_str = '(((A:0.5,B:0.5):0.5,C:0.5):0.5,(D:0.5,E:0.5):0.5); tree = dendropy.Tree.get(data=newick_str, schema='newick') mrca = tree.mrca(taxon_labels=('A', 'B')) print 'mcra.edge_length', mrca.edge_length tree.reroot_at_edge(mrca.edge, length1=0.5 mrca.edge_length, length2=0.5 \ mrca.edge_length, update_bipartitions=True) for child in tree.seed_node.child_nodes(): print 'child.edge_length', child.edge_length tree.write_to_path('rooted.tree', schema='newick')|


Jeet Sukumaran

jeetsukumaran@gmail.com

Blog/Personal Pages: http://jeetworks.org/ GitHub Repositories: http://github.com/jeetsukumaran Photographs (as stream): http://www.flickr.com/photos/jeetsukumaran/ Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

donovan-h-parks commented 9 years ago

These work, though it is not intuitive to me that a tree needs to be rooted before it can be rerooted with expected results. Rooting an unrooted tree is a well-defined operation.

jeetsukumaran commented 9 years ago

I. As this thread is a public document, in the interests of information accuracy ...

(1) It is not the act of rooting an unrooted tree that causes the collapse of the basal bifurcation. It is the fact that the bipartitions are calculated on an unrooted tree. This is documented.

(2) The reasons for collapsing the basal bifurcation when calculating the bipartitions on an unrooted tree are sound: if not, there would be two identical bipartitions on the same tree, making calculations and indeed any other operations on the tree undefined. This is also documented.

(3) If you want to avoid this collapsing of the basal bifurcation while calculating bipartitions on an unrooted tree, you may also do so. This option is also documented.

II. On a more personal note ...

One's intuition is conditioned on many factors, much of which is constrained by the personal history and immediate use-cases that one has personally encountered. As it happens, in this case, defaults to result in behavior conforming to your intuition would fail in some of the broader range of applications that the library functionality is put to use. In particular, maintaining the basal bifurcation when bipartitions are calculated on an unrooted tree would result in putting the machine in an unacceptable unpredictable state. I agree that the result of this is counter-intuitive. Or, at least, it was to me until I considered the implications of NOT collapsing that basal bifurcation; now doing it any other way makes no sense. In either case, counter-intuitive or not (or perhaps precisely because the only possible behavior that could be supported by default to ensure correct behavior was counter-intuitive), we took great pains to document this.

III. Related to this, on another personal note ...

The documentation is far from perfect, or even complete. We add to it and correct it all the time, in many cases based on feedback from users who found errors orgaps. If your expectations and intuitions have been misled by the documentation, please point it out, and we will be happy to fix it. If some issue or glitch or gotcha is missing from the documentation, we will be happy to fill in the gap. I am entirely open to it needing improvement and welcome suggestions (or better yet, pull requests) to improve it. However, I believe that, with regards to issues being discussed here, the documentation is at least adequate and clear and unambiguous, both from an operational perspective (what behavior to expect and what you need to do get the behavior you want) as well from a conceptual one (why the library behaves the way it does). If this is not the case, please let us know.

IV. On another personal note, but less directed toward yourself personally, as much as to other persons who have stumbled onto this page because they were trying to figure out why their tree suddenly has a basal trifurcation ...

The fact is that we have put an immense effort into documenting the library. Effort that not only equals the actual programming effort, but was, frankly, tedious and time-consuming and entirely unrewarding. We did not enjoy it. We did not go to bed each night thinking, "can't wait to get up tomorrow to start working on the 'bipartitions' section!" There were many other things we would have preferred to do, from actually coding to walking the dogs to even, after the third solid week of writing the documentation, doing the dishes or vacuuming under the couch. But we did it anyway. We did it because the documentation, while not necessarily a sacred obligation, is contract with the users that is as important part of the library as any core functionality. It provides a way to bridge across personal histories and idiosyncracies and expectations. It communicates how the library works to users. It is meant to highlight usage issues such as this, where some behavior might make sense from the big picture view of the design, but result in some glitches or gotchas in a narrow range os sub-application. In an ideal universe, neither the reading nor the writing of documentation would be necessary. Library API's would be structured purely on common intuition. This is impossible when the number of users exceeds 1. Or indeed, even if the number of users is equal to 1 but that user is not found in the set of developers. Hence, the need and importance of the documentation to both the developers and users. To communicate how the library works to the users.

So, please, please, please, please, please: read the documentation.

daveuu commented 6 years ago

I realise this is retreading old ground for the umpteenth time, but . . . .

Well, first off, the library is well documented especially around rooting state of trees (thanks!). This makes sense too:

It is not the act of rooting an unrooted tree that causes the collapse of the basal bifurcation. It is the fact that the bipartitions are calculated on an unrooted tree. This is documented.

But I do think this point has merit:

Rooting an unrooted tree is a well-defined operation

So I have two suggestions:

  1. if the Tree.reroot_at_edge() and similar methods really do return a rooted tree (as in a tree with a basal bifurcation), then would it not be intuitive for the method to also update the rooted state i.e., set Tree.is_rooted = True ? That way if/when Tree.update_bipartitions() is called it would only update bipartitions, which I think is more intuitive than implicitly changing the topology as well. I realise the documentation explains why the topology will change under the current behaviour, but IMHO it is not intuitive.

  2. if the rooting state of the tree should dictate how methods work e.g. the Tree.update_bipartitions() just mentioned, then Tree.reroot_at_edge() and similar methods should raise an exception if their Tree instance is not rooted to begin with. After all, rerooting implies the tree is already rooted and the root is being moved, which is a different operation to applying a root de novo to an unrooted tree.

I suspect either of the above would avoid confusion among users who haven't (recently) read the documentation.

Thanks for a great library.