matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Review/fix rerooting procedure #202

Closed metasoarous closed 6 years ago

metasoarous commented 6 years ago

So... I discovered something a little disconcerting this weekend. I... I think we're okay here, but was more than a little startled and would like others to review. Somehow I'd never actually looked at the reroot_tree function we use in bin/process_asr.py:

def reroot_tree(tree, pattern='.*naive.*'):
    # find all nodes matching pattern
    node = find_node(tree, pattern)
    if tree != node:
        tree.remove_child(node)
        node.add_child(tree)
        tree.dist = node.dist
        node.dist = 0
        tree = node
    return tree

As it turns out, on my latest run of the data, this particular function was failing on the call to remove_child, saying 'child not found'. Once I started actually looking at what this function is doing, I realized it's a bit batshit. It's finding the naive node, removing it from the tree, and then placing it at the root! Which is... batshit! Generally anyway. I immediately became paralyzed with fear that all our trees have in fact been generally batshit. However, as I've poked around I've realized that the remove_child function only works if the node being removed is in fact a direct child of the root node of the tree, and other wise raises the error which initially started me down this line of investigation. So, it would appear to be the case that the only way this function can work is if in fact the node being removed (the naive node) is already a direct child of the root node, and thus is already an outgroup. In that case, it replaces the root with the naive node. So maybe we're fine?

What bothers me here is that this is the first time I've encountered this error. I've built thousands of trees at this point and my analysis above would have me believe this is the first one where the naive isn't a direct descendant of the root. Maybe this is possible, but it has me a little worried I'm missing something.

My inclination is to fix this function up to actually reroot the tree when necessary, but I'd like to get another set of eyes on this before deciding there hasn't been any damage done here. @matsen ?

matsen commented 6 years ago

However, as I've poked around I've realized that the remove_child function only works if the node being removed is in fact a direct child of the root node of the tree, and other wise raises the error which initially started me down this line of investigation. So, it would appear to be the case that the only way this function can work is if in fact the node being removed (the naive node) is already a direct child of the root node, and thus is already an outgroup.

If this is indeed the case then we seem like we're in fine shape, no? I assume you've checked the various calls to this function?

Re when it's necessary to reroot the tree, there was some technical reason for it that I'm sure @WSDeWitt can help with. But this should definitely be better documented.

wsdewitt commented 6 years ago

@metasoarous Since phylip is given the partis naive sequence as if were a taxon, our approach to rooting the trees on the naive sequences (which they literally are rooted on, biologically) is conceptually the following:

  1. set the naive as an outgroup in the phylip config
  2. for ASR purposes, we now rotate, or lift, the naive outgroup up so it's in the root position. Just imagine picking up the tree from the naive taxon, and letting everything else hang below it.

There is no ETE function for picking up trees by a given node and letting everything else shake about (now that would be some impressive guano). Instead, we temporarily detach the naive taxon, then reattach the original root to it. Note that this does not change the tree topology at all.

metasoarous commented 6 years ago

Thank you both. Your point (1) @WSDeWitt sufficiently clarifies things for me. It particular, it explains why I didn't see this error till this run through. I've had to change the way we handle long sequence names to deal with the phylip sequence name truncation problem. We're now creating new names from scratch, together with a translation file that we use to get the right sequence names back. Because I was renaming the naive sequence together with all the rest, dnaml must have been silently (or perhaps with a warning I didn't notice) ignoring the lack of a naive0 sequence in the sequence set. I've fixed this now and am rerunning (see #201).

I'm a little torn now about whether to include a step in reroot_tree that would set_outgroup when needed. It seems like the only case where it would be needed is when there's been an error upstream. Maybe it's better to let that bubble up? Or specifically check that naive is an outgroup to make it clearer what the expectation is there?