matsengrp / historydag

https://matsengrp.github.io/historydag
GNU General Public License v3.0
0 stars 1 forks source link

Consider how to load in BEAST trees with compact descriptions of a history #42

Open matsen opened 1 year ago

matsen commented 1 year ago

Thanks to @jsigao we now understand that BEAST can output compact history descriptions that read like

[&rate=0.760250317223809,mutations=2.0,history_all={{13163,0.016625667248442385,A,G},{23756,0.020319042566174454,T,G}}]

We'd like to be able to load these and make a history DAG of them.

Here is a zipped up file with the original .trees file from BEAST, and also a file that has those annotations converted to NHX like so:


import dendropy

dp_trees = dendropy.TreeList.get(path="clade_13.GTR.history.trees", schema="nexus")
nhx_string = dp_trees.as_string(schema="newick", suppress_annotations=False, annotations_as_nhx=True, suppress_rooting=True)

with open("clade_13.GTR.history.nhx", "w") as fp:
    fp.write(nhx_string)

I then went in by hand and deleted the first comment on each line, which ete3 didn't like.

However, I wasn't able to get these immediately loaded into ete3 using something like

with open("clade_13.GTR.history.nhx", "r") as fp:
    ete_trees = [ete3.TreeNode(nhx_string, quoted_node_names=True) for nhx_string in fp]

So I'm afraid this isn't the tidiest issue, but hopefully it's a start.


Before diving into any particular approach, let's keep the big picture in mind. We want to be able to load these trees into the hDAG. We could

  1. have methods that would parse a dendropy tree into a MAD
  2. have an external script that would convert a dendropy tree into a MAD protobuf
  3. do what we're doing here: load an NHX parseable by ete3 and then turn those into a DAG, perhaps after writing to a protobuf?
willdumm commented 1 year ago

I just had a look, and noticed that the newicks exported by dendropy have commas in the extended newick field (for example, in history_all:

[&&NHX:rate=1.0:mutations=7.0:history_all={{11029,0.03866966440113948,T,A}]

I've run into this before... Unfortunately, the ete newick parser treats all commas in a newick string as node separators, and has no concept of context for commas appearing in the extended newick field.

The workaround is to convert these fields to a string format that's compatible with ete before the dendropy export. For example, if the node attribute history_all started off as a set, ete would export this node annotation as:

[&&NHX:rate=1.0:mutations=7.0:history_all=11029|0.03866966440113948|T|A]

However, if you load the newick containing this annotation back into ete, the node attribute history_all would now contain the string "11029|0.03866966440113948|T|A".

I think the easiest thing would be to add dendropy tree loading to historydag

matsen commented 1 year ago

K, thanks for looking into this. We can put it on ice for the time being.

willdumm commented 1 year ago

47 provides a generalized dag.history_dag_from_trees function that can load dendropy trees in a list treelist, with something like:

dag = hdag.history_dag_from_trees(
     [tree.seed_node for tree in treelist],
     [],
     label_functions={'sequence': lambda node: node.annotations.get_value('history_all')},
     child_node_func=dendropy.Node.child_nodes,
     leaf_node_func=dendropy.Node.leaf_iter
)

However, it looks to me like these are mutations relative to parent node, not relative to a fixed reference? If so, we're going to need the reference sequence, and probably also the complete leaf sequences, if they contain any ambiguities.

I'll look into this more next week.

matsen commented 1 year ago

GREAT!

These are indeed relative to parent.

willdumm commented 1 year ago

Here's a working script to load these trees:

import historydag as hdag
from functools import lru_cache
import dendropy

# dendropy doesn't parse nested lists correctly in metadata, so we load the
# trees with raw comment strings using `extract_comment_metadata`
dp_trees = dendropy.TreeList.get(
    path="clade_13.GTR.history.trees", schema="nexus", extract_comment_metadata=False
)

# This should be replaced with the true reference genome.
reference_genome = "A" * 30000
reference_cg = hdag.compact_genome.CompactGenome({}, reference_genome)

def comment_parser(node_comments):
    if len(node_comments) == 0:
        yield from ()
        return
    elif len(node_comments) == 1:
        comment_string = node_comments[0]
    else:
        raise ValueError("node_comments has more than one element" + str(node_comments))
    if "history_all=" not in comment_string:
        yield from ()
        return
    else:
        mutations_string = comment_string.split("history_all=")[-1]
        stripped_mutations_list = mutations_string[2:-2]
        if stripped_mutations_list:
            mutations_list = stripped_mutations_list.split("},{")
            for mut in mutations_list:
                try:
                    idx_str, _, from_base, to_base = mut.split(",")
                except ValueError:
                    raise ValueError("comment_parser failed on: " + str(node_comments))
                assert to_base in 'AGCT'
                yield from_base + idx_str + to_base
        else:
            yield from ()
            return

_comment_parser_test_data = [
    (["&rate=1.0,mutations=0.0"], []),
    (["&rate=1.0,mutations=0.0,history_all={}"], []),
    (["&rate=1.0,mutations=0.0,history_all={{}}"], []),
    (["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G}}"], ["A12G"]),
    (
        ["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G},{1,0.99,A,C}}"],
        ["A12G", "A1C"],
    ),
]

for in_data, expected in _comment_parser_test_data:
    parsed_data = list(comment_parser(in_data))
    if parsed_data != expected:
        raise RuntimeError(
            f"comment_parser: unexpected output {parsed_data} from input: {in_data}. Expected {expected}"
        )

@lru_cache(maxsize=(2 * len(dp_trees[0].nodes())))
def compute_cg(node):
    if node.parent_node is None:
        # base case: node is a root node
        parent_cg_mut = reference_cg
    else:
        parent_cg_mut = compute_cg(node.parent_node)
    return parent_cg_mut.apply_muts(comment_parser(node.comments))

dag = hdag.history_dag_from_trees(
    [tree.seed_node for tree in dp_trees],
    [],
    label_functions={
        "compact_genome": compute_cg,
    },
    attr_func=lambda n: {
        "name": (n.taxon.label if n.is_leaf() else "internal"),
    },
    child_node_func=dendropy.Node.child_nodes,
    leaf_node_func=dendropy.Node.leaf_iter,
)

dag = hdag.mutation_annotated_dag.CGHistoryDag.from_history_dag(dag)
dag.summary()

The output from dag.summary:

<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Nodes:  315
Edges:  315
Histories:      3
Unique leaves in DAG:   159
Leaf node count range: 53 to 53
Smallest history has 105 nodes, largest history has 105 nodes
Average number of parents of internal nodes:    1.0
Parsimony score range 421 to 586

I'll open a dendropy issue to get correct parsing of these newicks figured out.

Note that just counting the mutations along all the branches of the beast trees gives wildly different (larger) parsimony scores, because there are lots of mutation reversions on single branches.

willdumm commented 1 year ago

Issue opened https://github.com/jeetsukumaran/DendroPy/issues/145

willdumm commented 1 year ago

Here's a modified script that fixes the reference sequence to be the sequence associated to the leaf England/CAMC-F00E8F/2021|2021-01-07. However, right now this sequence is assumed to be all A's. Since this is certainly incorrect, reverted mutations on other branches are causing all of the other leaves to have different compact genomes on each of the trees. Having the true sequence for this reference leaf might fix that...

import historydag as hdag
from functools import lru_cache
import dendropy
import xml.etree.ElementTree as ET

# get alignment from xml:
_etree = ET.parse('clade_13.GTR.xml')
_alignment = _etree.getroot().find('alignment')
fasta = {a[0].attrib['idref'].strip(): a[0].tail.strip() for a in _alignment}

# dendropy doesn't parse nested lists correctly in metadata, so we load the
# trees with raw comment strings using `extract_comment_metadata`
dp_trees = dendropy.TreeList.get(
    path="clade_13.GTR.history.trees", schema="nexus", extract_comment_metadata=False
)

# This should be replaced with the true reference genome.
reference_leaf = 'England/CAMC-F00E8F/2021|2021-01-07'
reference_leaf_sequence = fasta[reference_leaf]
reference_leaf_cg = hdag.compact_genome.CompactGenome({}, reference_leaf_sequence)

# ### Comment parser and testing
def comment_parser(node_comments):
    if len(node_comments) == 0:
        yield from ()
        return
    elif len(node_comments) == 1:
        comment_string = node_comments[0]
    else:
        raise ValueError("node_comments has more than one element" + str(node_comments))
    if "history_all=" not in comment_string:
        yield from ()
        return
    else:
        mutations_string = comment_string.split("history_all=")[-1]
        stripped_mutations_list = mutations_string[2:-2]
        if stripped_mutations_list:
            mutations_list = stripped_mutations_list.split("},{")
            for mut in mutations_list:
                try:
                    idx_str, _, from_base, to_base = mut.split(",")
                except ValueError:
                    raise ValueError("comment_parser failed on: " + str(node_comments))
                assert to_base in 'AGCT'
                yield from_base + idx_str + to_base
        else:
            yield from ()
            return

_comment_parser_test_data = [
    (["&rate=1.0,mutations=0.0"], []),
    (["&rate=1.0,mutations=0.0,history_all={}"], []),
    (["&rate=1.0,mutations=0.0,history_all={{}}"], []),
    (["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G}}"], ["A12G"]),
    (
        ["&rate=1.0,mutations=0.0,history_all={{12,1.1,A,G},{1,0.99,A,C}}"],
        ["A12G", "A1C"],
    ),
]

for in_data, expected in _comment_parser_test_data:
    parsed_data = list(comment_parser(in_data))
    if parsed_data != expected:
        raise RuntimeError(
            f"comment_parser: unexpected output {parsed_data} from input: {in_data}. Expected {expected}"
        )
# ### End comment parser testing

def compute_cgs(tree):
    ref_node = tree.find_node_with_taxon_label(reference_leaf)
    ref_cg = reference_leaf_cg
    while True:
        ref_cg = ref_cg.apply_muts(list(comment_parser(ref_node.comments)), reverse=True)
        ref_node = ref_node.parent_node
        if ref_node is None:
            break

    @lru_cache(maxsize=(2 * len(dp_trees[0].nodes())))
    def compute_cg(node):
        if node.parent_node is None:
            # base case: node is a root node
            parent_cg_mut = ref_cg
        else:
            parent_cg_mut = compute_cg(node.parent_node)
        return parent_cg_mut.apply_muts(comment_parser(node.comments))

    for node in tree.preorder_node_iter():
        node.cg = compute_cg(node)

for tree in dp_trees:
    compute_cgs(tree)

dag = hdag.history_dag_from_trees(
    [tree.seed_node for tree in dp_trees],
    [],
    label_functions={
        "compact_genome": lambda n: n.cg,
    },
    attr_func=lambda n: {
        "name": (n.taxon.label if n.is_leaf() else "internal"),
    },
    child_node_func=dendropy.Node.child_nodes,
    leaf_node_func=dendropy.Node.leaf_iter,
)

dag = hdag.mutation_annotated_dag.CGHistoryDag.from_history_dag(dag)
dag.summary()
willdumm commented 1 year ago

^ My last comment has now been edited to load the alignment from the beast XML file, and use the appropriate one as the starting sequence for attempting to reconstruct the reference. More work needs to be done though, since all of the leaves are ambiguous, and are resolved differently in each tree. I haven't completely decided how to resolve this, but I'll come back to it tomorrow.

willdumm commented 1 year ago

I've included a generalized version of this, with fixes thanks to Jiansi, in the historydag.beast_loader submodule in the wd-load-beast-trees branch. This may still be a work in progress for awhile, but it seems to work as expected at the moment, and correctly reconstructs trees' reference sequences.

In addition to using the most recent state of that branch, you must have dendropy installed to use the feature.

Example use, equivalent to the scripts above (with some additional fixes):

import historydag
dag = historydag.beast_loader.dag_from_beast_trees('clade_13.GTR.xml', 'clade_13.GTR.history.trees')
dag.summary()

Now the output looks like this:

<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Nodes:  281
Edges:  299
Histories:      3
Unique leaves in DAG:   133
Average number of parents of internal nodes:    1.027027027027027

In histories in the DAG:
Leaf node count range: 53 to 53
Total node count range: 105 to 105
Average pairwise RF distance:   177.33333333333334
Parsimony score range 328 to 474

So, we still have some different realizations of the same leaf, but not as bad as before.

The key insight to correctly load these trees was that BEAST ignores sites that are all N or ?, and then the site indices of mutations are in the sequence with those sites removed.

In addition, I remove sites which are all - (or some combination of only N, ? or -) because strange mutations seem to happen at such sites, and we're treating - as N anyway.