matsengrp / larch

Inference and manipulation of history DAGs
2 stars 2 forks source link

ambiguous DAG IO #64

Closed marybarker closed 9 months ago

marybarker commented 1 year ago

Overview

Currently, we can load a DAG from a MAT protobuf and equip it with ambiguous compact genome data from a VCF file, as long as that VCF uses the same keys for leaf node names as the MAT protobuf does.

We would like to be able to extend this ambiguous functionality to DAG protobufs(i.e. to read from a DAG protobuf and add VCF data to it) and also to store our ambiguous DAGs, once they have been used in the larch-usher pipeline.

Change to DAG bookkeeping and RecomputeCompactGenomes

While not necessary, it would facilitate things greatly if we can read and write ambiguous DAGs directly from the DAG protobuf format. We can avoid using VCF data in general if the DAG is equipped with fully disambiguated edge sequences (We drop the assumption that computing the compact genome for a specific leaf gives the same result if we choose any one of its parents). Instead, we will want to visit each parent edge of the leaf node, and for each variant site, we will choose the least ambiguous base that matches all of the parents at that site. So if a leaf has n parent edges, each edge i with mutation set m_i, then the compact genome will have variant sites corresponding to every mutation in all of the mutation sets, and for a specific site j, the compact genome at that site will be j:X where X is the ambiguity code that matches all of the parents CGs at site j So we will change RecomputeCompactGenomes so that when the bool recompute_leaves argument is set to true, the leaf CGs are computed to be the most specific ambiguity code compatible with all of its parents.

Changes to DAG loading

The DAG protobuf format has a field called "condensed leaves" that is currently unused, but which we could make use of to attach a SampleId to each leaf node. In this way, we can add vcf data and match it to the SampleIds. And, as mentioned in the previous section, we will drop the assumption that the leaf sequences are well defined, regardless of parent edge chosen to calculate compact genome.

Change to DAG storing

We will now add the leaf SampleIds to the protobuf, using the condensed leaves field in the protobuf.

Change to DAGToFasta

We will now allow ambiguous compact genomes on leaf sequences, so the DAGToFasta routine will output ambiguous sequences, calculated using the RecomputeCompactGenomes(recompute_leaves=true) method.