matsengrp / historydag

https://matsengrp.github.io/historydag
GNU General Public License v3.0
0 stars 1 forks source link

Ambiguous Protobuf Support #82

Closed willdumm closed 10 months ago

willdumm commented 11 months ago

The main purpose of this PR is ambiguous protobuf support. It adapts the existing protobuf reading and writing to expect unique leaf identifier strings in the condensed_leaves field on node records in the protobuf. Protobuf reading code was rewritten from scratch, for a significant speedup and much better organization. The relevant methods/functions are mutation_annotated_dag.load_MAD_protobuf and mutation_annotated_dag.CGHistoryDag.to_protobuf. In order to allow optional faster loading of protobufs without reconstructing compact genomes on nodes, there is a new HistoryDag subclass called NodeIDHistoryDag which expects only node_id node label fields. This is useful for comparing trees/dags topologically, but does not support writing to protobuf or parsimony score computations (it lacks mutation information).

Larch MAD protobufs potentially represent DAGs with ambiguous compact genomes on leaves, but unambiguous CGs on all other nodes. All edge mutations are unambiguous, meaning that different pendant edges may imply different (unambiguous) CGs on the same leaf. Unless leaf CG information is provided to the protobuf loading method explicitly, ambiguous leaf CGs are inferred to be the least-ambiguous CGs that do not conflict with any mutations on pendant edges pointing to each leaf. This means that different protobuf files produced on the same ambiguous alignment may not have matching leaves when being loaded in Python. In this situation, unique leaf node IDs can be used to identify leaves, leaf CGs can be provided explicitly, or the protobufs can be merged with Larch before being loaded in Python.

Many other improvements are included, motivated by all the testing that was needed for these additions:

Remaining Tasks: