Tree sequence metadata: should equivalence be apply to nodes rather than edges?

hyanwong commented 2 years ago

I have a Relate .anc file obtained by running Relate on the Kreitman dataset (see e.g. here and here for the input files). The result looks like this:

NUM_HAPLOTYPES 11
NUM_TREES 3
0: 14:(29696.15474 1.000 0 42) 14:(29696.15474 0.000 0 42) 11:(132099.38841 2.000 0 42) 11:(132099.38841 2.000 0 42) 19:(1097319.06730 0.000 0 2) 17:(337897.97775 0.000 0 2) 18:(749824.72825 0.000 0 8) 12:(15276.27278 0.000 0 42) 13:(22892.23989 0.000 0 42) 12:(15276.27278 0.000 0 42) 15:(312743.61877 6.000 0 42) 16:(21049.13806 0.000 0 2) 13:(7615.96710 0.000 0 42) 15:(289851.37888 4.000 0 42) 16:(123452.37172 0.000 0 2) 18:(437081.10949 4.000 0 8) 17:(184749.45129 0.000 0 2) 20:(1148478.52987 0.000 0 2) 19:(347494.33904 0.000 0 2) 20:(389057.44032 0.000 0 2) -1:(0.00000 0.000 0 2) 
2: 14:(68828.05415 1.000 0 42) 14:(68828.05415 0.000 0 42) 11:(67112.96300 2.000 0 42) 11:(67112.96300 2.000 0 42) 16:(132314.52530 0.000 2 8) 18:(715670.73924 0.000 2 8) 19:(707965.35335 0.000 0 8) 12:(7390.25936 0.000 0 42) 13:(35257.33297 0.000 0 42) 12:(7390.25936 0.000 0 42) 15:(339432.34186 6.000 0 42) 16:(65201.56230 0.000 2 8) 13:(27867.07362 0.000 0 42) 15:(304175.00888 4.000 0 42) 17:(614679.72859 0.000 2 8) 19:(368533.01149 4.000 0 8) 17:(551193.25744 1.000 2 8) 18:(32162.95650 0.000 2 8) 20:(179288.62387 0.000 2 8) 20:(186994.00976 0.000 2 8) -1:(0.00000 0.000 2 8) 
8: 14:(31386.04149 1.000 0 42) 14:(31386.04149 0.000 0 42) 11:(158435.40061 2.000 0 42) 11:(158435.40061 2.000 0 42) 19:(547044.95307 1.000 8 42) 17:(60103.18646 1.000 8 42) 17:(60103.18646 0.000 8 42) 12:(16681.07449 0.000 0 42) 13:(49260.60316 0.000 0 42) 12:(16681.07449 0.000 0 42) 15:(320265.19935 6.000 0 42) 16:(113529.62320 0.000 8 42) 13:(32579.52867 0.000 0 42) 15:(271004.59619 4.000 0 42) 16:(240578.98232 9.000 8 42) 19:(226779.75372 3.000 8 42) 18:(173167.40982 5.000 8 42) 18:(385029.24716 0.000 8 42) 20:(153338.87597 1.000 8 42) 20:(51426.35653 0.000 8 42) -1:(0.00000 0.000 8 42)

From the first entry in each line, I deduce that e.g. that in all trees (positions 0..42) sample 0 and sample 1 have the parent node 14, and samples 2 and 3 have the parent node 11. Thus we can conclude that nodes 11 and 14 are shared between all trees. The .anc and .mut files are attached to this issue.

Then I can run the following command:

relate_lib/bin/Convert --mode ConvertToTreeSequence \
    --anc Kreitman_SNP.anc.txt \
    --mut Kreitman_SNP.mut.txt \
    -o Kreitman_SNP

I was told by @a-ignatieva that the metadata in this tree sequence pointed out which nodes were equivalent in each tree, ~so I assume that the parent of node 0 in tree 0 should have metadata pointing to the parent of node 0 in tree 1, etc.~ - (edit: these show the edges above a node which are equivalent) The times are correct in each case, but the nodes are not marked as shared:

import tskit
ts = tskit.load("Kreitman_SNP.trees")

for tree in ts.trees():
    parent_of_0 = ts.node(tree.parent(0))
    assert ts.node(tree.parent(1)) == parent_of_0
    parent_of_2 = ts.node(tree.parent(2))
    assert ts.node(tree.parent(3)) == parent_of_2
    print("Tree", tree.index, "parent of 0", parent_of_0.id, "at time", parent_of_0.time, parent_of_0.metadata)
    print("Tree", tree.index, "parent of 2", parent_of_2.id, "at time", parent_of_2.time, parent_of_2.metadata)

gives

Tree 0 parent of 0 14 at time 29696.154296875 b'-1 -1'
Tree 0 parent of 2 11 at time 132099.390625 b'-1 -1'
Tree 1 parent of 0 24 at time 68828.0546875 b'-1 -1'
Tree 1 parent of 2 21 at time 67112.9609375 b'-1 -1'
Tree 2 parent of 0 34 at time 31386.041015625 b'-1 -1'
Tree 2 parent of 2 31 at time 158435.40625 b'-1 -1'

Instead, nodes 12 and 13 seem to be shared:

>>> print(ts.node(12))
Node(id=12, flags=0, time=15276.2724609375, population=-1, individual=-1, metadata=b'-1 22')
>>> print(ts.node(13))
Node(id=13, flags=0, time=22892.240234375, population=-1, individual=-1, metadata=b'-1 23')

Am I misinterpreting the metadata, or is there a bug in the output?

Kreitman_SNP.anc.txt Kreitman_SNP.mut.txt

hyanwong commented 2 years ago

@a-ignatieva tells me that "edges above sample nodes don't get labelled in the metadata". Is this right @leospeidel ?

hyanwong commented 2 years ago

I chatted to Ana - it seems like the equivalence is between the edges, so that if a node has metadata "-1 23", that means the edge above the node in that tree is equivalent, and both node 23 and the parent of 23 can be deemed equivalent nodes.

The problem with this notation is that we have no way of marking whether the edges above samples are equivalent or not.

I suggest that you might want to change the notation to node equivalence, so that the metadata denotes which nodes (rather than edges) are equivalent across local trees. That would allow nodes like 14 in this example to be marked as equivalent.

nspope commented 1 year ago

I think this should be addressed by #4, but I'm curious what you think of the strategy there @hyanwong. Nodes are deemed equivalent if they are in adjacent trees and have the same set of descendant samples. This definitely helps compress the ARG, but it's on the order of 1/5 the original size (so: still huge).

But, I guess there's a tradeoff between compression and capturing uncertainty in node ages. In that, although there's a lot of redundant information wrt to genetic polymorphisms with this compression strategy, the TMRCAs between samples will be "smoother" than they would be with more aggressive compression.

hyanwong commented 1 year ago

Oh, that's neat. I didn't know that you were working on this @nspope. I think this "shared samples" strategy is what Relate uses internally to match edges, so it seems right to me (although Leo will know better). I seem to remember that if you match nodes on shared samples, you can create cyclical ARG topologies sometimes, though? I can't seem to find an example, however. I would have thought you might hit it quite quickly on large inferred tree sequences.

nspope commented 1 year ago

That's an interesting point @hyanwong -- by cyclical topologies, do you mean that there's not always a path between all samples and a given root? I assume that'd be detected by tskit table sorting?

hyanwong commented 1 year ago

I mean that you can find a node that is a descendant of another node in one tree, but an ancestor of the same node in another tree. There's no way to rationalise such a topology, and it will always cause sorting to fail.

nspope commented 1 year ago

I see -- haven't run into that so far. I wonder if the "adjacent trees" requirement avoids that issue (as opposed to deeming nodes equivalent if they have the same descendant sample set, regardless of where they are in the tree sequence).

hyanwong commented 1 year ago

It may be that I was wrong, and that if you unify two nodes on the basis of their shared samples, you never get cycles.

hyanwong commented 1 year ago

I see -- haven't run into that so far. I wonder if the "adjacent trees" requirement avoids that issue (as opposed to deeming nodes equivalent if they have the same descendant sample set, regardless of where they are in the tree sequence).

Yes, I think there might be something in that. I guess one way would be to try to come up with simple example. You would have thought you would hit the problem soon enough if you run the convert routine on a large set of random inferences.

a-ignatieva commented 1 year ago

It was the anc/mut files in the relate_lib example folder that wouldn't work with your code I think @hyanwong (here)

nspope commented 1 year ago

Thanks for pointing that issue out @a-ignatieva -- that example works fine with what's in #4, so I don't think I'm hitting the cyclic issue there. But, I'll see if I can trigger via simulation.

hyanwong commented 1 year ago

It was the anc/mut files in the relate_lib example folder that wouldn't work with your code I think @hyanwong (here)

Ah, thanks @a-ignatieva . I must say, on thinking over the logic, I can't see how it's possible to get a cyclical ARG, even with non-adjacent tree node matching but often I find my intuition is not correct with this damn mathematical graph logic 😄 .

hyanwong commented 1 year ago

This definitely helps compress the ARG

Interesting, by the way, that you think of the relate tree sequence as an ARG... (not that's I'm disagreeing)

a-ignatieva commented 1 year ago

Maybe this is an example of how the cycles can arise when edges aren't matched exactly, but approximately? If clades 1, 2 and 5 are the right size (e.g. they each contain 10 samples), I think based on the correlation cut-off of 0.9 the approximate matching of Relate would assign edges B-C and A-D as equivalent. But I think this won't happen if you require that nodes have exactly the same descendants.

hyanwong commented 1 year ago

Thanks @a-ignatieva - really helpful. I agree that if there is approximate matching then you are very likely to get cycles. Does Relate do approximate matching by default when unifying edges?

a-ignatieva commented 1 year ago

I believe so!

hyanwong commented 1 year ago

That would explain why I hit the cycle problem, but @nspope isn't hitting it. My approach used the edge-matching information directly from Relate to construct the graph, whereas I suspect that @nspope is rolling his own, and requiring exactly the same set of descendants.

nspope commented 1 year ago

Yeah, this compression method is using exact matching of sets of descendant leaves to define equivalence. But, it's the same matching algorithm as what is being currently used to add "equivalence information" to edge metadata in the Convert utility, with a couple small modifications. I guess that this differs from what is done in Relate proper.

a-ignatieva commented 1 year ago

Ah yes, looks like DumpAsTreeSequenceTopoOnly uses exact matching, while DumpAsTreeSequence has a threshold of 0.95. I think that explains it though. In fact when I set this threshold to 1 and thus get exact matching, your code works without throwing the cycles error @hyanwong

Wonder if there's a way of doing this with approximate matching but without creating the cycles problem? I dunno how much more compression you can get that way.

nspope commented 1 year ago

Makes sense @a-ignatieva, if cycles could be detected on the fly then it'd be easy to use this as a criterion for inserting a new node id when building tables. I do wonder how much noise fuzzy matching would introduce into the node age constraint used in #4, though.

nspope commented 1 year ago

Interesting, by the way, that you think of the relate tree sequence as an ARG... (not that's I'm disagreeing)

Ah, that's me abusing terminology more than anything else @hyanwong. Though I do tend to think of tree sequences as ARGs even when they don't contain vertices for recombination events.

hyanwong commented 1 year ago

if cycles could be detected on the fly then it'd be easy to use this as a criterion for inserting a new node id when building tables.

I worry that we might hit weird order effects here. E.g. you could get a different graph when constructing left-to-right rather than right-to-left along the genome. Maybe that's OK, but it seems icky.

a-ignatieva commented 1 year ago

I'm struggling to picture an example where you get order issues, because the notion of equivalence is "symmetric" and so is the notion of "cycles" as you move from tree to tree - but I could be wrong!

hyanwong commented 1 year ago

I think the cycles could be revealed in a different order when going L->R than R->L, so new nodes would be created in a different order, which could break cycles in different ways. But as you say, we could test it.

In python I would do this by saving as a networkx graph as the tree sequence is built. In C, I guess you could roll your own "detect a cycle in a DAG" code (various examples online).

nspope commented 1 year ago

Would it suffice to ensure that there's monotonicity in cardinality of leaf sets for nodes shared across trees, I wonder? That is, if $\mathcal{D}^{(t)}_i$ is the set of descendant leaves for node $i$ in tree $t$, and we want to ensure that adding an edge $(i, j)$ in tree $t$ won't create a cycle, check that $|\mathcal{D}^{(k)}_i| \geq |\mathcal{D}^{(k)}_j|$ for $k = 1 \dots t-1$ (skipping trees where one or both nodes are absent). This would be quite easy to track on the fly, and (I think) would result in the same topology regardless of the direction in which trees are processed.

leospeidel / relate_lib

Tree sequence metadata: should equivalence be apply to nodes rather than edges? #3