Closed willdumm closed 2 years ago
@psathyrella This seems to be working well for me now. If you'd like, you could send me some of your ambiguous data to test with, or if you'd like to try it yourself here's a docker file which installs the proper versions of everything.
I had to add a .txt extension so GitHub would let me upload it. Dockerfile.txt
Awesome! Thanks. I'll have a go but may not get to it til Monday.
Addresses #71, allowing ambiguous observed sequences to be used as gctree input.
First, while earlier versions of gctree may have accepted ambiguous sequences as input without crashing, I can now be quite sure that the inference result was, in general, not meaningful.
This is a challenge, since when given a set of ambiguous sequences, it's not even clear which correspond to the same "true" sequence.
Here's the approach I've taken:
dnapars
.original_ids
attribute containing the (deduplicated) sequence ids merged into that leaf. Abundances of merged leaves are summed. All tree modifications are mirrored on the original (ambiguous) tree, and finally the disambiguated leaf sequences are transplanted back into the ambiguous tree. We now have ambiguous dnapars trees whose leaves are unambiguous, and which each accept at least one MP disambiguation.original_ids
attributes.Other changes:
original_ids
instead ofname
in order to correctly accumulate all the observed isotypes corresponding to a node.What's left to do:
phylip_pars.disambiguate
to accept a supplemental weight (e.g. mutability parsimony)historydag
release which accommodates trees with different sets of leaf labels. This is done, but needs to be released.Why not to trust leaf disambiguation:
dnapars
topology, one will be chosen arbitrarily, even though another may be more plausible IRL, or with respect to the ranking criteria that gctree uses.Why to trust leaf disambiguation:
dnapars
), which takes into account all that is known about the sequences.