jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
210 stars 61 forks source link

mrca changes underlying tree? #51

Closed wwood closed 8 years ago

wwood commented 8 years ago
from dendropy import Tree

tr = Tree.get(schema='newick', data='((A,B):1,(C,(D,E):2):3);')
print str(tr)

lca = tr.mrca(taxon_labels=['A', 'B'])
print str(tr)

gives

((A,B):1.0,(C,(D,E):2.0):3.0)
((A,B):4.0,C,(D,E):2.0)

I've tried this using python2 & 3, as well as on 4.1.0 and development-master. Calling update_bipartitions() before mrca was no help either.

mrca isn't supposed to change the tree, right?

jeetsukumaran commented 8 years ago

Actually, both update_bipartitions() and mrca collapse the basal split in an unrooted tree. This is not just a common cause of confusion/irritation for many, but the non-intuitiveness of this tends to drive a fairly respectable portion of folks to very strong expressions of counter-opinion. It really is technically correct though, and the only way to handle bipartition hashes on unrooted trees to avoid the duplicate hash.

Immediate solution is to simply force the tree to be rooted, using [&R] or passing in the keyword rooting="force-rooted".

If you are dealing with trees that should be rooted, that's all there is to it.

If are dealing with unrooted trees, then it becomes a little trickier. IIRC, the PhylogeneticDistanceMatrix class (which has an mrca method) does not calculate bipartition hashes, so this avoids the basal split being collapsed on unrooted trees. (NOTE: depending on your application you may actually prefer to use the PhylogeneticDistanceMatrix class anyway, as this can be quite a bit more efficient than the naive mrca function if called multiple times).

Some day, I will write up a short piece with a few examples of why the basal split must be collapsed on an unrooted trees when calculating bipartition hashes. Just a few trivial examples will make it clear. In the mean time, some discussion of this can also be found here:

https://github.com/jeetsukumaran/DendroPy/issues/30#issuecomment-137763310

wwood commented 8 years ago

OK, thanks for the explanations. It does seem to be counter-intuitive that mrca would actually modify underlying tree - I expected it to be a read-only operation. Maybe a quick note in the mrca documentation is warranted?

I wish to treat the tree as rooted, so the fix is straight-forward.

Thanks for the quick reply.

jeetsukumaran commented 8 years ago

It does seem to be counter-intuitive that |mrca| would actually modify underlying tree - I expected it to be a read-only operation. Maybe a quick note in the |mrca| documentation is warranted?

Yep, should do so (i.e. update documentation warning about side-effects of mrca).

The reason it does is that it calculates the bipartition hashes to get to the MRCA's, which results in the collapsing. Note that, strictly-speaking, technically, if actually an unrooted tree, then tree is not really modified in terms of the relationships it is modeling. if the tree is unrooted, then the basal split that is collapsed does not actually exist. It is simply an artifact of the tree representation.

But, in your case, you are dealing with rooted trees (even though DendroPy thinks it is dealing with unrooted trees), so the modification done by the operation has actual and incorrect results.

wwood commented 8 years ago

ok. Is it possible force rerooting after parsing from the newick so that the tree is as if it was read in with rooting="force-rooted" ? I'm hoping to write a method that assumes a rooted tree has been given without actually requiring is_rooted=True.

jeetsukumaran commented 8 years ago

If you have not already tried it, when reading the tree just using the rooting-specification keyword argument:

trees = dendropy.get(
    path="foo.tre",
    schema="newick",
    rooting="force-rooted")

This would be the cleanest approach.

Other approaches, in increasing "dirtiness":

(1) Set the is_rooted property on each tree to True (2) insert the rooting statement token in front of the tree string, "[&R]" (3) hack the newick parser to assume all trees are rooted

wwood commented 8 years ago

alrighty, even (1) seems a little brittle, I'll try to stick with the parse-approach. Thanks for all the help.

jeetsukumaran commented 8 years ago

(1) is actually the "best-practice" approach. It is a formal part of the API and is wrapped up in tests.

jeetsukumaran commented 8 years ago

Sorry, not (1), but (0), i.e. the rooting="force-rooted" is the "best-practice"/recommend approach.

(1) is OK, too. Basically, (0) does (1) under the hood at parse time.