matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

71 ambiguous input data #91

Closed willdumm closed 2 years ago

willdumm commented 2 years ago

Addresses #71, allowing ambiguous observed sequences to be used as gctree input.

First, while earlier versions of gctree may have accepted ambiguous sequences as input without crashing, I can now be quite sure that the inference result was, in general, not meaningful.

This is a challenge, since when given a set of ambiguous sequences, it's not even clear which correspond to the same "true" sequence.

Here's the approach I've taken:

  1. observed sequences are deduplicated as before. That is, two sequences are considered the same iff they are the same, as string literals.
  2. deduplicated sequences are fed as-is (containing ambiguities) to dnapars.
  3. for each ambiguous tree output by dnapars, a single MP disambiguation is chosen arbitrarily. If multiple leaves end up with the same disambiguation, they are merged by collapse so that only one representative is left as a leaf. All leaves are given a original_ids attribute containing the (deduplicated) sequence ids merged into that leaf. Abundances of merged leaves are summed. All tree modifications are mirrored on the original (ambiguous) tree, and finally the disambiguated leaf sequences are transplanted back into the ambiguous tree. We now have ambiguous dnapars trees whose leaves are unambiguous, and which each accept at least one MP disambiguation.
  4. As before, we use the history DAG to disambiguate internal nodes in all possible ways. However, we now include node abundances in DAG node leaf labels, to differentiate between leaves on different trees which may possibly have the same sequence but different merged abundances (as a result of step 3). Internal node labels all have abundance 0 so that collapsing works correctly.
  5. Once the history DAG is collapsed, all internal edges are between nodes with different sequences. Now we put correct abundances on leaf-adjacent nodes which have the same sequence as one of their leaf children. This allows dynamic programming methods that calculate likelihood, etc. to correctly predict which leaf-adjacent edges will be collapsed, and makes abundance visible to edges above observed internal nodes.
  6. Ranking occurs as before in the history DAG, and CollapsedTrees are extracted. CollapsedTree nodes retain original_ids attributes.

Other changes:

What's left to do:

Why not to trust leaf disambiguation:

Why to trust leaf disambiguation:

willdumm commented 2 years ago

@psathyrella This seems to be working well for me now. If you'd like, you could send me some of your ambiguous data to test with, or if you'd like to try it yourself here's a docker file which installs the proper versions of everything.

I had to add a .txt extension so GitHub would let me upload it. Dockerfile.txt

psathyrella commented 2 years ago

Awesome! Thanks. I'll have a go but may not get to it til Monday.