matsengrp / larch

Inference and manipulation of history DAGs
2 stars 2 forks source link

Optimized DAG contains nodes with identical LeafSets and CGs but different node IDs. #86

Open willdumm opened 1 month ago

willdumm commented 1 month ago

The following exploration uses the optimized DAG /fh/fast/matsen_e/mbarker/larch/luka_optimized_10_iterations.pb. That DAG was produced by optimizing /fh/fast/matsen_e/wdumm/luka_larch/Lukas_dag.pb, which was prepared in Python from a gctree parsimony forest on a gcreplay alignment.

@marybarker should confirm, but I suspect the larch command was something like

larch-usher -i Lukas_dag.pb \
    -o luka_optimized_10_iterations.pb \
    -c 10 \
    --sample-method rf-maxsum

There's an issue with DAGs produced by Larch where nodes with the same child clades and compact genome are given different node IDs, and it seems these node IDs are being used to distinguish nodes (which they should not be).

Here's an exploration that illustrates the problem with the optimized hDAG referenced above:

>>> import historydag as hdag
>>> dag = hdag.mutation_annotated_dag.load_MAD_protobuf_file(
...     "luka_optimized_10_iterations.pb",
...     compact_genomes=True
... )
>>> cdag = dag.copy()

We'll just trim to MP trees so it's quicker to work with:

>>> cdag.trim_optimal_weight()
229
>>> cdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields:   ('compact_genome', 'node_id')
Nodes:  2922
Edges:  31619
Histories:  1095840
Unique leaves in DAG:   120
Average number of parents of internal nodes:    4.898643825838687

In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 180
Average pairwise RF distance:   37.334056849448594
Parsimony score range 229 to 229

This matches the information about the (trimmed) dag that Larch reported.

Now, to verify my claim that there are duplicate nodes assigned different node_ids:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for n in cdag.preorder(skip_ua_node=True):
...     d[(n.label.compact_genome, n.child_clades())].append(n)
... 
>>> from collections import Counter
>>> Counter(len(ls) for _, ls in d.items())
Counter({1: 2907, 2: 6, 3: 1})
>>> false_duplicates = [ls for _, ls in d.items() if len(ls) > 1]
>>> len(false_duplicates)
7
>>> [[len(n.clade_union()) for n in nl] for nl in false_duplicates]
[[6, 6], [5, 5], [2, 2], [2, 2, 2], [2, 2], [2, 2], [2, 2]]
>>> [[n.label.node_id for n in nl] for nl in false_duplicates]
[['113', '42071'], ['129', '128526'], ['3852', '14186'],
['5051', '14460', '826'], ['999', '3771'], ['1019', '114159'],
['1109', '43398']]

So, the duplicate nodes in this case are relatively close to leaves, and there aren't many of them.

Here's what happens when we get rid of node IDs. We can do this here because there are no ambiguities in leaf sequences, and leaf sequences are unique.

If node IDs were being assigned correctly, we should end up with a new hDAG with the same number of nodes, edges, and histories, but here we expect to end up with 8 fewer nodes, since that's the number of duplicates we saw above:

>>> ccdag = cdag.remove_label_fields(["node_id"])
>>> ccdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields:   ('compact_genome',)
Nodes:  2914
Edges:  31595
Histories:  967020
Unique leaves in DAG:   120
Average number of parents of internal nodes:    4.909090909090909

In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 180
Average pairwise RF distance:   37.82642904635102
Parsimony score range 229 to 229

unique node IDs didn't allow collapsing before (I verified this, but left it out here for clarity), but in their absence over 200 edges can be eliminated by collapsing, reducing the number of unique MP trees here to 12k from over 1M reported by Larch.

>>> ccdag.convert_to_collapsed()
>>> ccdag.summary()
<class 'historydag.mutation_annotated_dag.CGHistoryDag'>
Label fields:   ('compact_genome',)
Nodes:  2857
Edges:  31353
Histories:  12385
Unique leaves in DAG:   120
Average number of parents of internal nodes:    4.96200219218122

In histories in the DAG:
Leaf node count range: 120 to 120
Total node count range: 162 to 174
Average pairwise RF distance:   39.23777424136683
Parsimony score range 229 to 229

In summary, I think there are two issues, the first quite a bit more serious than the second:

  1. First, when optimized trees, or perhaps moves, are being merged into the hDAG by Larch, either node comparisons aren't working correctly (using only LeafSets and compact genomes), so nodes are comparing unequal when in fact they are equal, or node ids are being assigned before merging, and they're erroneously being used in the node comparison.
  2. Second, I don't remember if Larch attempts to do some kind of collapsing of edges without mutations before or during merge, but if so that doesn't seem to be working, since collapsing is possible after removing node IDs.