The main purpose of this PR is ambiguous protobuf support. It adapts the existing protobuf reading and writing to expect unique leaf identifier strings in the condensed_leaves field on node records in the protobuf. Protobuf reading code was rewritten from scratch, for a significant speedup and much better organization. The relevant methods/functions are mutation_annotated_dag.load_MAD_protobuf and mutation_annotated_dag.CGHistoryDag.to_protobuf. In order to allow optional faster loading of protobufs without reconstructing compact genomes on nodes, there is a new HistoryDag subclass called NodeIDHistoryDag which expects only node_id node label fields. This is useful for comparing trees/dags topologically, but does not support writing to protobuf or parsimony score computations (it lacks mutation information).
Larch MAD protobufs potentially represent DAGs with ambiguous compact genomes on leaves, but unambiguous CGs on all other nodes. All edge mutations are unambiguous, meaning that different pendant edges may imply different (unambiguous) CGs on the same leaf. Unless leaf CG information is provided to the protobuf loading method explicitly, ambiguous leaf CGs are inferred to be the least-ambiguous CGs that do not conflict with any mutations on pendant edges pointing to each leaf. This means that different protobuf files produced on the same ambiguous alignment may not have matching leaves when being loaded in Python. In this situation, unique leaf node IDs can be used to identify leaves, leaf CGs can be provided explicitly, or the protobufs can be merged with Larch before being loaded in Python.
Many other improvements are included, motivated by all the testing that was needed for these additions:
The dag.HistoryDag.relabel method was rewritten from scratch in a cleaner way, which also guarantees it will work whenever the relabeling function is injective on leaves (something that was already claimed in the docstring)
The method dag.HistoryDag.to_ascii was added to make it easy to print histories as ascii-art
The function dag.ascii_compare_histories was added to make it easy to compare two histories, aligned side-by-side as ascii art. It also works on dag nodes, as long as they're in a dag containing a single history. This is useful for comparing subtrees in a big tree.
parsimony_utils.AmbiguityMap now allows the user to specify the preferred character in the reverse map when there are multiple characters that map to the same set of bases
A missing check that node clades are pairwise disjoint was added to dag.HistoryDag._check_valid.
Remaining Tasks:
loading vcf files -- I'll leave this until we actually need to use it, I suspect it'll be easy but will add a new dependency. In the meantime, compact genomes can be loaded from a fasta using compact_genome.py:read_alignment. This requires specifying a reference sequence, or the first record in the fasta is used as the reference.
The main purpose of this PR is ambiguous protobuf support. It adapts the existing protobuf reading and writing to expect unique leaf identifier strings in the condensed_leaves field on node records in the protobuf. Protobuf reading code was rewritten from scratch, for a significant speedup and much better organization. The relevant methods/functions are
mutation_annotated_dag.load_MAD_protobuf
andmutation_annotated_dag.CGHistoryDag.to_protobuf
. In order to allow optional faster loading of protobufs without reconstructing compact genomes on nodes, there is a newHistoryDag
subclass calledNodeIDHistoryDag
which expects onlynode_id
node label fields. This is useful for comparing trees/dags topologically, but does not support writing to protobuf or parsimony score computations (it lacks mutation information).Larch MAD protobufs potentially represent DAGs with ambiguous compact genomes on leaves, but unambiguous CGs on all other nodes. All edge mutations are unambiguous, meaning that different pendant edges may imply different (unambiguous) CGs on the same leaf. Unless leaf CG information is provided to the protobuf loading method explicitly, ambiguous leaf CGs are inferred to be the least-ambiguous CGs that do not conflict with any mutations on pendant edges pointing to each leaf. This means that different protobuf files produced on the same ambiguous alignment may not have matching leaves when being loaded in Python. In this situation, unique leaf node IDs can be used to identify leaves, leaf CGs can be provided explicitly, or the protobufs can be merged with Larch before being loaded in Python.
Many other improvements are included, motivated by all the testing that was needed for these additions:
dag.HistoryDag.relabel
method was rewritten from scratch in a cleaner way, which also guarantees it will work whenever the relabeling function is injective on leaves (something that was already claimed in the docstring)dag.HistoryDag.to_ascii
was added to make it easy to print histories as ascii-artdag.ascii_compare_histories
was added to make it easy to compare two histories, aligned side-by-side as ascii art. It also works on dag nodes, as long as they're in a dag containing a single history. This is useful for comparing subtrees in a big tree.parsimony_utils.AmbiguityMap
now allows the user to specify the preferred character in the reverse map when there are multiple characters that map to the same set of basesdag.HistoryDag._check_valid
.Remaining Tasks:
compact_genome.py:read_alignment
. This requires specifying a reference sequence, or the first record in the fasta is used as the reference.