jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
207 stars 62 forks source link

`Tree.mrca` has side effect of changing rooting state #152

Closed mmore500 closed 5 months ago

mmore500 commented 1 year ago

Here is a minimum working example:

>>> import dendropy as dp
>>> tree = dp.Tree.get(data="((10,(((5,(11)),7,9))),((((1)))));", schema="newick")
>>> leaf_taxa = [leaf.taxon for leaf in tree.leaf_node_iter()]
>>> print("before tree.is_rooted", tree.is_rooted)
before tree.is_rooted None
>>> tree.mrca(taxa=leaf_taxa[:2])
<Node object at 0x7fd2df9bfee0: 'None' (None)>
>>> print("after tree.is_rooted", tree.is_rooted)
after tree.is_rooted False

Works on multiple versions of Python, including Python 3.9.16 (main, Dec 7 2022, 01:12:08) [GCC 11.3.0] on linux with dendropy 4.5.2.

In a more copy/pasteable form:

import dendropy as dp

tree = dp.Tree.get(
    data="((10,(((5,(11)),7,9))),((((1)))));",
    schema="newick",
)
leaf_taxa = [
    leaf.taxon for leaf in tree.leaf_node_iter()
]

print("before tree.is_rooted", tree.is_rooted)
tree.mrca(taxa=leaf_taxa[:2])
print("after tree.is_rooted", tree.is_rooted)

This side effect is can have very unexpected and perplexing consequences. Consider, for example

>>> import dendropy as dp
>>> tree = dp.Tree.get(data="((10,(((5,(11)),7,9))),((((1)))));", schema="newick")
>>> leaf_taxa = [leaf.taxon for leaf in tree.leaf_node_iter()]
>>> print("before", tree.as_string(schema='newick'))
before ((10,(((5,(11)),7,9))),((((1)))));

>>> tree.mrca(taxa=leaf_taxa[:2])
<Node object at 0x7f70837fbee0: 'None' (None)>
>>> print("after", tree.as_string(schema='newick'))
after [&U] (10,(((5,(11)),7,9)),((((1)))));

(Note the addition of the [&U] prefix after calling mrca.)

as far as I can tell what is happening is Tree.mrca

jeetsukumaran commented 1 year ago

Thanks for bringing this to our attention and this fix!

I appreciate the issue and the fix makes sense. I'll look into this in a bit, and please don't hesitate to send me a reminder if I dawdle.

One question I have is whether or not the '[&U]' is cosmetic or is the tree.is_rooted property set. If the latter, is it only changed if it is None, or is it otherwise unconditionally set?

Maybe not related: does it make sense to ask for an MRCA of an unrooted tree? Should we throw an error?

mmore500 commented 1 year ago

One question I have is whether or not the '[&U]' is cosmetic or is the tree.is_rooted property set. If the latter, is it only changed if it is None, or is it otherwise unconditionally set?

I did an additional experiment, and it looks like the rootedness is only changed to True from the None state. Here, I set the rootedness explicitly to False and it remained False after the mrca operation.

>>> import dendropy as dp
>>> tree = dp.Tree.get(
...     data="((10,(((5,(11)),7,9))),((((1)))));",
...     schema="newick",
... )
>>> leaf_taxa = [
...     leaf.taxon for leaf in tree.leaf_node_iter()
... ]
>>> print("first tree.is_rooted", tree.is_rooted)
first tree.is_rooted None
>>> tree.is_rooted = False
>>> print("second tree.is_rooted", tree.is_rooted)
second tree.is_rooted False
>>> tree.mrca(taxa=leaf_taxa[:2])
<Node object at 0x7f7e81d0cd60: 'None' (None)>
>>> print("after tree.is_rooted", tree.is_rooted)
after tree.is_rooted False
mmore500 commented 1 year ago

Maybe not related: does it make sense to ask for an MRCA of an unrooted tree? Should we throw an error?

Not sure :sweat_smile: I was kind of wondering that too

It looks like the existing behavior is to use seed_node implicitly as the root, even when is_rooted is False.

Calculating the MRCA between leaves 10 and 11 on this tree gives the root node R as the result. image

>>> import dendropy as dp
>>> tree = dp.Tree.get(
...     data="(10,(((5,(11)),7,9)),((((1)))))R;",
...     schema="newick",
... )
>>> leaf_taxa = [
...     leaf.taxon for leaf in tree.leaf_node_iter()
...     if leaf.taxon.label in ("10", "11")
... ]
>>> print(leaf_taxa)
[<Taxon 0x7efda9708ca0 '10'>, <Taxon 0x7efda9708f10 '11'>]
>>> print("first tree.is_rooted", tree.is_rooted)
first tree.is_rooted None
>>> tree.is_rooted = False
>>> print("second tree.is_rooted", tree.is_rooted)
second tree.is_rooted False
>>> print(tree.mrca(taxa=leaf_taxa[:2]))
<Node object at 0x7efda9708d90: 'R' (None)>
mmore500 commented 1 year ago

There could plausibly be some kind of argument to calculate the MRCA between two nodes in an unrooted tree by taking the shortest path between them (would be well-defined/unique) and selecting the node closest to the middle of this path. How the MRCA between more than two nodes might be interpreted in an unrooted tree isn't immediately obvious to me Plus, for either of these cases, I doubt there's a substantial intentional use case for MRCA in an unrooted tree anyways.

In the scenario I ran into, I was basically relying on the ambiguity of the None rootedness state to prevent introduction of the [&R] or [&U] markers into newick string output. This can certainly be accomplished in other ways.

Warning or throwing an error for the False and None could be a reasonable choice, especially from a strict/literal interpretation of MRCA.

mmore500 commented 1 year ago

I did a quick sanity check to see if BioPython had made any kind of decision here, and it looks like they're also just using the implicit root of an unrooted tree without a warning or error.

>>> from io import StringIO
>>> from Bio import Phylo
>>> tree = Phylo.read(StringIO("(10,(((5,(11)),7,9)),((((1)))))R;"), "newick")
>>> tree.rooted
False
>>> leaf1 = tree.get_terminals()[0]
>>> leaf1
Clade(name='10')
>>> leaf2 = tree.get_terminals()[1]
>>> leaf2
Clade(name='5')
>>> mrca = tree.common_ancestor(leaf1, leaf2)
>>> mrca
Clade(name='R')
jeetsukumaran commented 1 year ago

Ok, let's go ahead and merge this. Maybe a warning would be good thing to add though