The following exploration uses the optimized DAG /fh/fast/matsen_e/mbarker/larch/luka_optimized_10_iterations.pb. That DAG was produced by optimizing /fh/fast/matsen_e/wdumm/luka_larch/Lukas_dag.pb, which was prepared in Python from a gctree parsimony forest on a gcreplay alignment.
@marybarker should confirm, but I suspect the larch command was something like
There's an issue with DAGs produced by Larch where nodes with the same child
clades and compact genome are given different node IDs, and it seems these
node IDs are being used to distinguish nodes (which they should not be).
Here's an exploration that illustrates the problem with the optimized hDAG referenced above:
>>> import historydag as hdag
>>> dag = hdag.mutation_annotated_dag.load_MAD_protobuf_file(
... "luka_optimized_10_iterations.pb",
... compact_genomes=True
... )
>>> cdag = dag.copy()
We'll just trim to MP trees so it's quicker to work with:
>>> cdag.trim_optimal_weight()
229
>>> cdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields: ('compact_genome', 'node_id')
Nodes: 2922
Edges: 31619
Histories: 1095840
Unique leaves in DAG: 120
Average number of parents of internal nodes: 4.898643825838687
In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 180
Average pairwise RF distance: 37.334056849448594
Parsimony score range 229 to 229
This matches the information about the (trimmed) dag that Larch reported.
Now, to verify my claim that there are duplicate nodes assigned different
node_ids:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for n in cdag.preorder(skip_ua_node=True):
... d[(n.label.compact_genome, n.child_clades())].append(n)
...
>>> from collections import Counter
>>> Counter(len(ls) for _, ls in d.items())
Counter({1: 2907, 2: 6, 3: 1})
>>> false_duplicates = [ls for _, ls in d.items() if len(ls) > 1]
>>> len(false_duplicates)
7
>>> [[len(n.clade_union()) for n in nl] for nl in false_duplicates]
[[6, 6], [5, 5], [2, 2], [2, 2, 2], [2, 2], [2, 2], [2, 2]]
>>> [[n.label.node_id for n in nl] for nl in false_duplicates]
[['113', '42071'], ['129', '128526'], ['3852', '14186'],
['5051', '14460', '826'], ['999', '3771'], ['1019', '114159'],
['1109', '43398']]
So, the duplicate nodes in this case are relatively close to leaves, and there
aren't many of them.
Here's what happens when we get rid of node IDs. We can do this here because
there are no ambiguities in leaf sequences, and leaf sequences are unique.
If node IDs were being assigned correctly, we should end up with a new hDAG
with the same number of nodes, edges, and histories, but here we expect to end
up with 8 fewer nodes, since that's the number of duplicates we saw above:
>>> ccdag = cdag.remove_label_fields(["node_id"])
>>> ccdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields: ('compact_genome',)
Nodes: 2914
Edges: 31595
Histories: 967020
Unique leaves in DAG: 120
Average number of parents of internal nodes: 4.909090909090909
In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 180
Average pairwise RF distance: 37.82642904635102
Parsimony score range 229 to 229
unique node IDs didn't allow collapsing before (I verified this, but left it
out here for clarity), but in their absence over 200
edges can be eliminated by collapsing, reducing the number of unique MP trees here to
12k from over 1M reported by Larch.
>>> ccdag.convert_to_collapsed()
>>> ccdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields: ('compact_genome',)
Nodes: 2857
Edges: 31353
Histories: 12385
Unique leaves in DAG: 120
Average number of parents of internal nodes: 4.96200219218122
In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 174
Average pairwise RF distance: 39.23777424136683
Parsimony score range 229 to 229
In summary, I think there are two issues, the first quite a bit more serious
than the second:
First, when optimized trees, or perhaps moves, are being merged into the
hDAG by Larch, either node comparisons aren't working correctly (using
only LeafSets and compact genomes), so nodes are comparing unequal when in
fact they are equal, or node ids are being assigned before merging, and
they're erroneously being used in the node comparison.
Second, I don't remember if Larch attempts to do some kind of collapsing of
edges without mutations before or during merge, but if so that doesn't
seem to be working, since collapsing is possible after removing node IDs.
The following exploration uses the optimized DAG
/fh/fast/matsen_e/mbarker/larch/luka_optimized_10_iterations.pb
. That DAG was produced by optimizing/fh/fast/matsen_e/wdumm/luka_larch/Lukas_dag.pb
, which was prepared in Python from a gctree parsimony forest on a gcreplay alignment.@marybarker should confirm, but I suspect the larch command was something like
There's an issue with DAGs produced by Larch where nodes with the same child clades and compact genome are given different node IDs, and it seems these node IDs are being used to distinguish nodes (which they should not be).
Here's an exploration that illustrates the problem with the optimized hDAG referenced above:
We'll just trim to MP trees so it's quicker to work with:
This matches the information about the (trimmed) dag that Larch reported.
Now, to verify my claim that there are duplicate nodes assigned different node_ids:
So, the duplicate nodes in this case are relatively close to leaves, and there aren't many of them.
Here's what happens when we get rid of node IDs. We can do this here because there are no ambiguities in leaf sequences, and leaf sequences are unique.
If node IDs were being assigned correctly, we should end up with a new hDAG with the same number of nodes, edges, and histories, but here we expect to end up with 8 fewer nodes, since that's the number of duplicates we saw above:
unique node IDs didn't allow collapsing before (I verified this, but left it out here for clarity), but in their absence over 200 edges can be eliminated by collapsing, reducing the number of unique MP trees here to 12k from over 1M reported by Larch.
In summary, I think there are two issues, the first quite a bit more serious than the second: