Provide option to squash IBD output?

gtsambos commented 2 years ago

One of the most important comments I got in my thesis examination was regarding our definition of IBD. Our current implementation of ibd_segments treats segments as distinct if they have been inherited along distinct 'genealogical paths', which are essentially sequences of edges. So in this instance,

    #
    #        4       |      4       |        4
    #       / \      |     / \      |       / \
    #      /   \     |    /   3     |      /   \
    #     /     \    |   2     \    |     /     \
    #    /       \   |  /       \   |    /       \
    #   0         1  | 0         1  |   0         1
    #                |              |
    #                0.2            0.7

there will be 3 separate IBD segments for the sample pair (0,1), because the paths in the middle tree explicitly contain some extra intermediary nodes. (This has caused confusion for other users before: see #1479).

I was pretty sure this behaviour was desirable until I read this comment of Graham's:

if I’m understanding things correctly, I’m not sure the path definition actually lines up with most people’s definition of halotype IBD. The good news is that given segments called from the tree-sequence path definition you can always stitch them to form IBD segments in post-processing, but I caution against calling the original segments IBD. (*Indeed, there are many unsampled lineages in the population that will coalesce and recombine with the lineages [0] & [1] before [0] & [1] coalesce together, our definition of IBD should depend on the properties of just the history of the pair not of the coalescence pattern of other sampled or unsampled alleles.

This is a super valid point, and kind of major imo. In many of the tree sequences we come across, nodes like (2) and (3) will be coalescence nodes that relate our samples to some other sample that has recombined at the breakpoints (who we are perhaps uninterested in). It's bad if our definition of IBD between the pair (0, 1) depends on whether those other samples are included in the trees or not, and currently it does.

The good news is that given segments called from the tree-sequence path definition you can always stitch them to form IBD segments in post-processing

We've certainly talked about doing this before, but I'm unsure of how much effort it will take. I'm about to push up a draft PR that implements these changes in our Python IbdFinder, but I'm not sure this would translate well to the C code given the fancy struct that Jerome made for the output. The necessary algorithms are themselves fairly simple -- a sort method and a squash method, both very similar to those used on edge tables.

pinging @jeromekelleher, @petrelharp

jeromekelleher commented 2 years ago

Hmm, this could be tricky, although it depends on how we order things. Ideally we'd squash "in place" as the segments are added to the structure.

Do you know if this is possible?

gtsambos commented 2 years ago

Ideally we'd squash "in place" as the segments are added to the structure. Do you know if this is possible?

I was thinking about this too. The problem is that the segments aren't discovered in order, so we would have to be constantly re-sorting the segments (as this is required to squash them).

(To be more specific: the segments should be discovered in a sequence that is ordered by node ID, but not by left/right coordinates necessarily, and not by the sample pair they belong to)

if there's some kind of temporary list-like structure that we could use to temporarily hold the segments in while we process all the segments with a given ancestral node, we could squash them in small batches before adding to the final output? I don't know whether that's an even more complicated option, though.

jeromekelleher commented 2 years ago

Hmm. Would you mind putting in a concrete example here please (as in, the output of tskit's code)? I'm having trouble seeing exactly what needs to be done.

So, in this example from the docs,

(0, 1) [IdentitySegment(left=2.0, right=10.0, node=4), IdentitySegment(left=0.0, right=2.0, node=5)]
(0, 2) [IdentitySegment(left=2.0, right=10.0, node=3), IdentitySegment(left=0.0, right=2.0, node=5)]
(1, 2) [IdentitySegment(left=0.0, right=2.0, node=4), IdentitySegment(left=2.0, right=10.0, node=4)]

we'd squash the segment for (1,2) into one?

gtsambos commented 2 years ago

So, in this example from the docs, we'd squash the segment for (1,2) into one?

yup, that's right. Here's another example:

    #
    #        5         |
    #       / \        |
    #      /   4       |      4
    #     /   / \      |     / \
    #    /   /   \     |    /   \
    #   /   /     \    |   3     \
    #  /   /       \   |  / \     \
    # 0   1         2  | 0   2     1
    #                  |
    #                  0.2

as currently coded, ibd_segments returns two segments for the pair (1, 2):

[IdentitySegment(left=0.0, right=0.2, node=4), IdentitySegment(left=0.2, right=1.0, node=4)]

because in the first tree there's an edge directly joining 2 to 4, but in the second there's a sequence of edges joining 2 -> 3 -> 4. Under this proposed change there'd be just one segment

[IdentitySegment(left=0.0, right=1.0, node=4))]

jeromekelleher commented 2 years ago

That's helpful, thanks. Would you mind pasting in a larger example (no need for a drawing) that shows what the current sorting of segments is, and what the required output would be?

gtsambos commented 2 years ago

Sure, here's an example of some output I pulled out of a simulated tree sequence:

(4, 9): [IdentitySegment(left=6.0, right=10.0, node=37),
IdentitySegment(left=0.0, right=2.0, node=42),
IdentitySegment(left=5.0, right=6.0, node=42),
IdentitySegment(left=2.0, right=5.0, node=42)]

See how it is sorted by node, but the segments with common ancestor 42 are not sorted by coordinates. That would look like this:

(4, 9): [IdentitySegment(left=6.0, right=10.0, node=37),
IdentitySegment(left=0.0, right=2.0, node=42),
IdentitySegment(left=2.0, right=5.0, node=42),
IdentitySegment(left=5.0, right=6.0, node=42)]

jeromekelleher commented 2 years ago

Right. So, I guess to do this we'd need to update the in-memory data structure to use an avl tree sorted by left coordinate, and then squash as you go.

This is a bit problematic because the numbers that we are currently reporting when we don't store segments won't be accurate. It'll be a pretty disruptive change I think, and will need careful thought on how to manage it.

I guess the simple approach as you say is to sort and squash afterwards, but then that leaves significant room for people to do the wrong thing. We'd prefer if the number of IBD segments was the "correct" thing by default.

gtsambos commented 2 years ago

This is a bit problematic because the numbers that we are currently reporting when we don't store segments won't be accurate. It'll be a pretty disruptive change I think, and will need careful thought on how to manage it.

I guess the simple approach as you say is to sort and squash afterwards, but then that leaves significant room for people to do the wrong thing. We'd prefer if the number of IBD segments was the "correct" thing by default.

Yeah, I can see it could be tricky to implement, it would change all of the benchmarking results I have now as well. It might also interact with the way we use the length argument. (edit: actually, I think we'd have to ask users to not use the length argument)

petrelharp commented 2 years ago

I'm not following the algorithmic details, but FWIW I totally agree with the original point.

gtsambos commented 2 years ago

in my latest changes to #2460 (pushed last week, but I'm only getting round to talking about it now -- sorry!!), I've added some python code that sorts and squashes the IBD segments as they are added to the output object. The sorting procedure isn't (and can't be) done exactly as it would in the C code, since there are no AVL trees in the Python code, but hopefully the squashing part looks relevant.

Some thoughts about consequences of this change:

min_span: Annoyingly, we can no longer filter the segments by their length until after all IBD segments from a given ancestral node are calculated. (In the PR, I actually do it all at once, post-hoc). This is because the squashed segments may exceed the min_span even if the raw unsquashed segments do not.

Methods for computing number of segments and other summary statistics: these may need to be changed to account for the fact that existing IBD segments may be squashed with segments that are discovered later.

computational efficiency: obviously this change adds a bunch of new steps into the procedure. In the Python PR, these changes have a noticeable impact on the time needed to run some of the unit tests :( I comment on it here

petrelharp commented 2 years ago

This point about length filtering seems like a big problem - doesn't that mean that we have to keep track of all IBD segments, basically forever, even with min_length? And, that's bad, since there's too many of them? Perhaps we could do something hack-ey like filtering segments below min_length / 5. Or, just apply min_length to the per-path segments, and still return them squashed?

gtsambos commented 2 years ago

doesn't that mean that we have to keep track of all IBD segments, basically forever, even with min_length

It's definitely a pain. What we could do is

add all IBD segments with a given ancestral node into a temporary result object
squash the segments in the temporary result object
filter out any segments smaller than the min_span and then
add them to the final output.

This is a more involved change that what I've currently put in the PR, but I'm happy to implement if it seems better than the alternative. Again, I imagine this would be more of a problem in the C code because the temporary result object would have to involve another sorted AVL tree, which is 'merged' back into the main output once all edges with a given parent node have been processed.

Perhaps we could do something hack-ey like filtering segments below min_length / 5. Or, just apply min_length to the per-path segments, and still return them squashed?

If we are to do one of these two, personally I'd prefer the first. In general I think it would be super confusing if the algorithm used different definitions of IBD for different input arguments.

petrelharp commented 2 years ago

I'm not sure if that does it, though? Suppose that our threshold is 10kb and there's two 6kb blocks that pass through various other ancestors before being merged in an ancestor; wouldn't we filter these out in the earlier ancestors even though we want them, eventually?

gtsambos commented 2 years ago

Sorry Peter, I missed this! So what you're referring to is the list of ancestral segments (an internal object) whereas I'm referring to the outputted list of IBD segments -- but -- you're totally right, there are memory implications of this too.

bodkan commented 1 year ago

Hey all!

I discovered this issue after some head scratching while I was trying to understand what's going on with IBD segments extracted from simplified vs non-simplified tree sequences simulated with SLiM (I've been working on a rudimentary interface for working with IBD data for slendr). Entirely my fault, the tutorial clearly explains what is going on with those "broken up" IBD segments, I just haven't connected the docs upon reading it.

I'm now interested in finding the best approach to filter for the length of the "complete" IBD segments (i.e. "squashed" adjacent IBDs between pairs of nodes with the same MRCA node). I read through the discussion in this issue and looked through some related PRs while hunting for some clever solution to this.

Given that this issue is still open, I suspect that nothing of this sort is yet implemented in a general way at the tskit level. Am I correct?

Just to be clear, I totally appreciate how tricky this problem is. I only wanted to make sure I haven't missed something -- either something in tskit itself, or a clever solution not yet merged in!

Once again, thank you for all the amazing work you've put into tskit! It's astonishing how easy it makes to just look at the IBDs along the tree sequence, plot the trees, look at summary tables... incredible stuff.

jeromekelleher commented 1 year ago

Hey @bodkan, great to hear this is useful to you! I think the short answer is "no", we've not done anything to follow up on this, and I'm not sure when we'll get around to it. @gtsambos might have more to say?

gtsambos commented 1 year ago

Hi @bodkan, sorry I missed this yesterday -- I meant to respond in the evening and then got pulled away by something else.

This caveat is turning out to be a bit of a pain for a few different users 😅 I don't currently have anything in the works here -- there are some people in the Browning lab here at UW who have expressed some interest in a fix too, and who I'll be meeting with in the next few weeks (mostly to explain why it is such a difficult problem). As you've noticed, it's a tricky technical issue.

At the moment, my main recommendation is on the simulation side -- if it is at all tractable for you, I'd suggest using this method only on a tree sequence generated with a DTWF simulation, and with the keep_unary=True option if you then use simplify(). It is somewhat counterintuitive, but having these extra edges in your tree sequence should force ibd_segments to make the output squashed by default.

Btw, this is because ibd_segments is using common sequences of edges to determine where IBD segments start and end. In regular coalescent trees (like those returned by the default simplify()), the same genealogical path may be represented by different sequences of edges in different parts of the genome. This happens because some nodes on that path may be coalescent in one genomic region and not coalescent in another. However, forcing the simulation to keep a record of these nodes everywhere should fix that problem.

jeromekelleher commented 1 year ago

Ah, that's interesting - some changes in the pipeline in msprime from @GertjanBisschop to allow storing unary nodes in a granular way should help?

gtsambos commented 1 year ago

Ah, I wasn't aware of those changes 👀 sounds like it would be very useful indeed! This is also related to the work that @petrelharp is talking about in an upcoming tskit meeting (though I believe he is thinking about inferring these extra nodes back onto trees that don't have them).

jeromekelleher commented 1 year ago

Yes, it's all closely related. Unary nodes FTW!

tskit-dev / tskit

Provide option to squash IBD output? #2459