Open hyanwong opened 2 years ago
@a-ignatieva tells me that "edges above sample nodes don't get labelled in the metadata". Is this right @leospeidel ?
I chatted to Ana - it seems like the equivalence is between the edges, so that if a node has metadata "-1 23", that means the edge above the node in that tree is equivalent, and both node 23 and the parent of 23 can be deemed equivalent nodes.
The problem with this notation is that we have no way of marking whether the edges above samples are equivalent or not.
I suggest that you might want to change the notation to node equivalence, so that the metadata denotes which nodes (rather than edges) are equivalent across local trees. That would allow nodes like 14 in this example to be marked as equivalent.
I think this should be addressed by #4, but I'm curious what you think of the strategy there @hyanwong. Nodes are deemed equivalent if they are in adjacent trees and have the same set of descendant samples. This definitely helps compress the ARG, but it's on the order of 1/5 the original size (so: still huge).
But, I guess there's a tradeoff between compression and capturing uncertainty in node ages. In that, although there's a lot of redundant information wrt to genetic polymorphisms with this compression strategy, the TMRCAs between samples will be "smoother" than they would be with more aggressive compression.
Oh, that's neat. I didn't know that you were working on this @nspope. I think this "shared samples" strategy is what Relate uses internally to match edges, so it seems right to me (although Leo will know better). I seem to remember that if you match nodes on shared samples, you can create cyclical ARG topologies sometimes, though? I can't seem to find an example, however. I would have thought you might hit it quite quickly on large inferred tree sequences.
That's an interesting point @hyanwong -- by cyclical topologies, do you mean that there's not always a path between all samples and a given root? I assume that'd be detected by tskit table sorting?
I mean that you can find a node that is a descendant of another node in one tree, but an ancestor of the same node in another tree. There's no way to rationalise such a topology, and it will always cause sorting to fail.
I see -- haven't run into that so far. I wonder if the "adjacent trees" requirement avoids that issue (as opposed to deeming nodes equivalent if they have the same descendant sample set, regardless of where they are in the tree sequence).
It may be that I was wrong, and that if you unify two nodes on the basis of their shared samples, you never get cycles.
I see -- haven't run into that so far. I wonder if the "adjacent trees" requirement avoids that issue (as opposed to deeming nodes equivalent if they have the same descendant sample set, regardless of where they are in the tree sequence).
Yes, I think there might be something in that. I guess one way would be to try to come up with simple example. You would have thought you would hit the problem soon enough if you run the convert routine on a large set of random inferences.
It was the anc/mut files in the relate_lib example folder that wouldn't work with your code I think @hyanwong (here)
Thanks for pointing that issue out @a-ignatieva -- that example works fine with what's in #4, so I don't think I'm hitting the cyclic issue there. But, I'll see if I can trigger via simulation.
It was the anc/mut files in the relate_lib example folder that wouldn't work with your code I think @hyanwong (here)
Ah, thanks @a-ignatieva . I must say, on thinking over the logic, I can't see how it's possible to get a cyclical ARG, even with non-adjacent tree node matching but often I find my intuition is not correct with this damn mathematical graph logic 😄 .
This definitely helps compress the ARG
Interesting, by the way, that you think of the relate tree sequence as an ARG... (not that's I'm disagreeing)
Maybe this is an example of how the cycles can arise when edges aren't matched exactly, but approximately? If clades 1, 2 and 5 are the right size (e.g. they each contain 10 samples), I think based on the correlation cut-off of 0.9 the approximate matching of Relate would assign edges B-C and A-D as equivalent. But I think this won't happen if you require that nodes have exactly the same descendants.
Thanks @a-ignatieva - really helpful. I agree that if there is approximate matching then you are very likely to get cycles. Does Relate do approximate matching by default when unifying edges?
I believe so!
That would explain why I hit the cycle problem, but @nspope isn't hitting it. My approach used the edge-matching information directly from Relate to construct the graph, whereas I suspect that @nspope is rolling his own, and requiring exactly the same set of descendants.
Yeah, this compression method is using exact matching of sets of descendant leaves to define equivalence. But, it's the same matching algorithm as what is being currently used to add "equivalence information" to edge metadata in the Convert
utility, with a couple small modifications. I guess that this differs from what is done in Relate proper.
Ah yes, looks like DumpAsTreeSequenceTopoOnly uses exact matching, while DumpAsTreeSequence has a threshold of 0.95. I think that explains it though. In fact when I set this threshold to 1 and thus get exact matching, your code works without throwing the cycles error @hyanwong
Wonder if there's a way of doing this with approximate matching but without creating the cycles problem? I dunno how much more compression you can get that way.
Makes sense @a-ignatieva, if cycles could be detected on the fly then it'd be easy to use this as a criterion for inserting a new node id when building tables. I do wonder how much noise fuzzy matching would introduce into the node age constraint used in #4, though.
Interesting, by the way, that you think of the relate tree sequence as an ARG... (not that's I'm disagreeing)
Ah, that's me abusing terminology more than anything else @hyanwong. Though I do tend to think of tree sequences as ARGs even when they don't contain vertices for recombination events.
if cycles could be detected on the fly then it'd be easy to use this as a criterion for inserting a new node id when building tables.
I worry that we might hit weird order effects here. E.g. you could get a different graph when constructing left-to-right rather than right-to-left along the genome. Maybe that's OK, but it seems icky.
I'm struggling to picture an example where you get order issues, because the notion of equivalence is "symmetric" and so is the notion of "cycles" as you move from tree to tree - but I could be wrong!
I think the cycles could be revealed in a different order when going L->R than R->L, so new nodes would be created in a different order, which could break cycles in different ways. But as you say, we could test it.
In python I would do this by saving as a networkx graph as the tree sequence is built. In C, I guess you could roll your own "detect a cycle in a DAG" code (various examples online).
Would it suffice to ensure that there's monotonicity in cardinality of leaf sets for nodes shared across trees, I wonder? That is, if $\mathcal{D}^{(t)}_i$ is the set of descendant leaves for node $i$ in tree $t$, and we want to ensure that adding an edge $(i, j)$ in tree $t$ won't create a cycle, check that $|\mathcal{D}^{(k)}_i| \geq |\mathcal{D}^{(k)}_j|$ for $k = 1 \dots t-1$ (skipping trees where one or both nodes are absent). This would be quite easy to track on the fly, and (I think) would result in the same topology regardless of the direction in which trees are processed.
I have a Relate
.anc
file obtained by running Relate on the Kreitman dataset (see e.g. here and here for the input files). The result looks like this:From the first entry in each line, I deduce that e.g. that in all trees (positions 0..42) sample 0 and sample 1 have the parent node 14, and samples 2 and 3 have the parent node 11. Thus we can conclude that nodes 11 and 14 are shared between all trees. The
.anc
and.mut
files are attached to this issue.Then I can run the following command:
I was told by @a-ignatieva that the metadata in this tree sequence pointed out which nodes were equivalent in each tree, ~so I assume that the parent of node 0 in tree 0 should have metadata pointing to the parent of node 0 in tree 1, etc.~ - (edit: these show the edges above a node which are equivalent) The times are correct in each case, but the nodes are not marked as shared:
gives
Instead, nodes 12 and 13 seem to be shared:
Am I misinterpreting the metadata, or is there a bug in the output?
Kreitman_SNP.anc.txt Kreitman_SNP.mut.txt