niemasd / TreeSwift

TreeSwift: Fast tree module for Python 3
https://niema.net/TreeSwift
GNU General Public License v3.0
75 stars 14 forks source link

Node params relevant to branch lost in subtree after pruning because internal node is deleted #32

Closed pekarj closed 7 months ago

pekarj commented 7 months ago

Hey Niema, thanks for the continued updating of this software.

If I have a tree with branch X with descendant leaves A, B, and C and several mutations on branch X, and then I extract a subtree with just one of those three leaves, I'll lose the mutations corresponding to that branch because they would have been part of the node_params of an internal node. This will also happen if I extract with two of the three leaves but don't include the leaf descending from that internal node (since the relevant internal node will be deleted and then other internal node would be selected as its replacement, I think). I'm unsure whether this would be an easy fix and it can be straightforward to spot this issue with a small tree, but the behavior might pose a bigger problem with larger trees.

In the example below, if I extract the tree without leaf L4, I lose internal node I6 and therefore the mutations M5, M6, and M7.

tree = treeswift.read_tree_nexus(tree_path)['tree1']
exclude = ['L4']
tree_pruned = tree.extract_tree_without(exclude)

for n in tree.traverse_internal():
    if 'node_params' in dir(n):
        print(n, n.node_params)

for n in tree_pruned.traverse_internal():
    if 'node_params' in dir(n):
        print(n, n.node_params)

Tree attached.

tree_simple.nexus.gz

niemasd commented 7 months ago

Can you try running extract_tree_without with suppress_unifurcations=False? I have suppress_unifurcations=True as the default behavior because that's what most folks expect when they extract the pruned tree, but in this specific case, I think setting it to False to avoid suppressing unifurcations will preserve the mutations you're interested in. Specifically, given the file you attached, I ran the following:

tree = treeswift.read_tree_nexus(tree_path)['tree1']
exclude = ['L4']
# note the `suppress_unifurcations=False` argument I added
tree_pruned = tree.extract_tree_without(exclude, suppress_unifurcations=False)

print("=== ORIG ===")
for n in tree.traverse_internal():
    if 'node_params' in dir(n):
        print(n, n.node_params)

print()
print("=== PRUNE ===")
for n in tree_pruned.traverse_internal():
    if 'node_params' in dir(n):
        print(n, n.node_params)

and it prints the following (which I think should be correct based on your description):

=== ORIG ===
I4 &mutations="M1"
I5 &mutations="M2,M3,M4"
I6 &mutations="M5,M6,M7"
I10 &mutations="M8"
I11 &mutations="M10"
I12 &mutations="M11"
I13 &mutations="M17,M18,M19"
I14 &mutations="M20"

=== PRUNE ===
I4 &mutations="M1"
I5 &mutations="M2,M3,M4"
I6 &mutations="M5,M6,M7"
I10 &mutations="M8"
I11 &mutations="M10"
I12 &mutations="M11"
I13 &mutations="M17,M18,M19"
I14 &mutations="M20"

If this is indeed the case (that setting suppress_unifurcations=False produces the desired results), I can try adding a warning if suppress_unifurcations would result in the deletion of an internal node that has non-empty attributes to help programmers catch when this is happening. Thoughts?

pekarj commented 7 months ago

Yes, that fixes the issue. Thanks! I think adding the warning is fine for now (and maybe pointing out which node(s) with attributes was deleted); I'm not sure if this will be a frequent issue for people. In my case, I wanted to prune from 20 leaves to 3 and realized I needed to pick the "correct" 3 leaves to keep the mutations.

The easiest fix might be just traversing up until you get to a node with an edge length > 0 and then assigning the attributes to that node, but that would probably have its own concerns.

niemasd commented 7 months ago

Glad to hear that fixed the immediate issue!

Unfortunately, while your described approach might work for some scenarios, I think it might cause problems in others. I personally think keeping the unification in the tree would be the most appropriate if there's information in the internal nodes that you want to preserve, as you're otherwise losing information about when the mutations occurred, but others might not, so leaving this as something folks can implement on their own however works best for their specific problem is probably the right approach for the library as a whole

I'll add a warning if nodes with non-trivial info would be deleted (and add an argument to suppress the warning if desired), so hopefully that'll help avoid silently deleting critical info!

pekarj commented 7 months ago

That sounds good. Thanks, Niema!

niemasd commented 7 months ago

Warning added in TreeSwift v1.1.40 (commit https://github.com/niemasd/TreeSwift/commit/94a986231bb6f39fafe4cab1c39df2c54f6b2ece). Thanks, Jonathan!