Closed gtsambos closed 1 week ago
Hmm, this could be tricky, although it depends on how we order things. Ideally we'd squash "in place" as the segments are added to the structure.
Do you know if this is possible?
Ideally we'd squash "in place" as the segments are added to the structure. Do you know if this is possible?
I was thinking about this too. The problem is that the segments aren't discovered in order, so we would have to be constantly re-sorting the segments (as this is required to squash them).
(To be more specific: the segments should be discovered in a sequence that is ordered by node ID, but not by left/right coordinates necessarily, and not by the sample pair they belong to)
if there's some kind of temporary list-like structure that we could use to temporarily hold the segments in while we process all the segments with a given ancestral node, we could squash them in small batches before adding to the final output? I don't know whether that's an even more complicated option, though.
Hmm. Would you mind putting in a concrete example here please (as in, the output of tskit's code)? I'm having trouble seeing exactly what needs to be done.
So, in this example from the docs,
(0, 1) [IdentitySegment(left=2.0, right=10.0, node=4), IdentitySegment(left=0.0, right=2.0, node=5)]
(0, 2) [IdentitySegment(left=2.0, right=10.0, node=3), IdentitySegment(left=0.0, right=2.0, node=5)]
(1, 2) [IdentitySegment(left=0.0, right=2.0, node=4), IdentitySegment(left=2.0, right=10.0, node=4)]
we'd squash the segment for (1,2) into one?
So, in this example from the docs, we'd squash the segment for (1,2) into one?
yup, that's right. Here's another example:
#
# 5 |
# / \ |
# / 4 | 4
# / / \ | / \
# / / \ | / \
# / / \ | 3 \
# / / \ | / \ \
# 0 1 2 | 0 2 1
# |
# 0.2
as currently coded, ibd_segments
returns two segments for the pair (1, 2)
:
[IdentitySegment(left=0.0, right=0.2, node=4), IdentitySegment(left=0.2, right=1.0, node=4)]
because in the first tree there's an edge directly joining 2 to 4, but in the second there's a sequence of edges joining 2 -> 3 -> 4. Under this proposed change there'd be just one segment
[IdentitySegment(left=0.0, right=1.0, node=4))]
That's helpful, thanks. Would you mind pasting in a larger example (no need for a drawing) that shows what the current sorting of segments is, and what the required output would be?
Sure, here's an example of some output I pulled out of a simulated tree sequence:
(4, 9): [IdentitySegment(left=6.0, right=10.0, node=37),
IdentitySegment(left=0.0, right=2.0, node=42),
IdentitySegment(left=5.0, right=6.0, node=42),
IdentitySegment(left=2.0, right=5.0, node=42)]
See how it is sorted by node, but the segments with common ancestor 42 are not sorted by coordinates. That would look like this:
(4, 9): [IdentitySegment(left=6.0, right=10.0, node=37),
IdentitySegment(left=0.0, right=2.0, node=42),
IdentitySegment(left=2.0, right=5.0, node=42),
IdentitySegment(left=5.0, right=6.0, node=42)]
Right. So, I guess to do this we'd need to update the in-memory data structure to use an avl tree sorted by left coordinate, and then squash as you go.
This is a bit problematic because the numbers that we are currently reporting when we don't store segments won't be accurate. It'll be a pretty disruptive change I think, and will need careful thought on how to manage it.
I guess the simple approach as you say is to sort and squash afterwards, but then that leaves significant room for people to do the wrong thing. We'd prefer if the number of IBD segments was the "correct" thing by default.
This is a bit problematic because the numbers that we are currently reporting when we don't store segments won't be accurate. It'll be a pretty disruptive change I think, and will need careful thought on how to manage it.
I guess the simple approach as you say is to sort and squash afterwards, but then that leaves significant room for people to do the wrong thing. We'd prefer if the number of IBD segments was the "correct" thing by default.
Yeah, I can see it could be tricky to implement, it would change all of the benchmarking results I have now as well. It might also interact with the way we use the length
argument. (edit: actually, I think we'd have to ask users to not use the length argument)
I'm not following the algorithmic details, but FWIW I totally agree with the original point.
in my latest changes to #2460 (pushed last week, but I'm only getting round to talking about it now -- sorry!!), I've added some python code that sorts and squashes the IBD segments as they are added to the output object. The sorting procedure isn't (and can't be) done exactly as it would in the C code, since there are no AVL trees in the Python code, but hopefully the squashing part looks relevant.
Some thoughts about consequences of this change:
min_span: Annoyingly, we can no longer filter the segments by their length until after all IBD segments from a given ancestral node are calculated. (In the PR, I actually do it all at once, post-hoc). This is because the squashed segments may exceed the min_span
even if the raw unsquashed segments do not.
Methods for computing number of segments and other summary statistics: these may need to be changed to account for the fact that existing IBD segments may be squashed with segments that are discovered later.
computational efficiency: obviously this change adds a bunch of new steps into the procedure. In the Python PR, these changes have a noticeable impact on the time needed to run some of the unit tests :( I comment on it here
This point about length filtering seems like a big problem - doesn't that mean that we have to keep track of all IBD segments, basically forever, even with min_length
? And, that's bad, since there's too many of them? Perhaps we could do something hack-ey like filtering segments below min_length / 5
. Or, just apply min_length
to the per-path segments, and still return them squashed?
doesn't that mean that we have to keep track of all IBD segments, basically forever, even with min_length
It's definitely a pain. What we could do is
This is a more involved change that what I've currently put in the PR, but I'm happy to implement if it seems better than the alternative. Again, I imagine this would be more of a problem in the C code because the temporary result object would have to involve another sorted AVL tree, which is 'merged' back into the main output once all edges with a given parent node have been processed.
Perhaps we could do something hack-ey like filtering segments below min_length / 5. Or, just apply min_length to the per-path segments, and still return them squashed?
If we are to do one of these two, personally I'd prefer the first. In general I think it would be super confusing if the algorithm used different definitions of IBD for different input arguments.
I'm not sure if that does it, though? Suppose that our threshold is 10kb and there's two 6kb blocks that pass through various other ancestors before being merged in an ancestor; wouldn't we filter these out in the earlier ancestors even though we want them, eventually?
Sorry Peter, I missed this! So what you're referring to is the list of ancestral segments (an internal object) whereas I'm referring to the outputted list of IBD segments -- but -- you're totally right, there are memory implications of this too.
Hey all!
I discovered this issue after some head scratching while I was trying to understand what's going on with IBD segments extracted from simplified vs non-simplified tree sequences simulated with SLiM (I've been working on a rudimentary interface for working with IBD data for slendr). Entirely my fault, the tutorial clearly explains what is going on with those "broken up" IBD segments, I just haven't connected the docs upon reading it.
I'm now interested in finding the best approach to filter for the length of the "complete" IBD segments (i.e. "squashed" adjacent IBDs between pairs of nodes with the same MRCA node). I read through the discussion in this issue and looked through some related PRs while hunting for some clever solution to this.
Given that this issue is still open, I suspect that nothing of this sort is yet implemented in a general way at the tskit level. Am I correct?
Just to be clear, I totally appreciate how tricky this problem is. I only wanted to make sure I haven't missed something -- either something in tskit itself, or a clever solution not yet merged in!
Once again, thank you for all the amazing work you've put into tskit! It's astonishing how easy it makes to just look at the IBDs along the tree sequence, plot the trees, look at summary tables... incredible stuff.
Hey @bodkan, great to hear this is useful to you! I think the short answer is "no", we've not done anything to follow up on this, and I'm not sure when we'll get around to it. @gtsambos might have more to say?
Hi @bodkan, sorry I missed this yesterday -- I meant to respond in the evening and then got pulled away by something else.
This caveat is turning out to be a bit of a pain for a few different users 😅 I don't currently have anything in the works here -- there are some people in the Browning lab here at UW who have expressed some interest in a fix too, and who I'll be meeting with in the next few weeks (mostly to explain why it is such a difficult problem). As you've noticed, it's a tricky technical issue.
At the moment, my main recommendation is on the simulation side -- if it is at all tractable for you, I'd suggest using this method only on a tree sequence generated with a DTWF simulation, and with the keep_unary=True
option if you then use simplify()
. It is somewhat counterintuitive, but having these extra edges in your tree sequence should force ibd_segments
to make the output squashed by default.
Btw, this is because ibd_segments
is using common sequences of edges to determine where IBD segments start and end. In regular coalescent trees (like those returned by the default simplify()
), the same genealogical path may be represented by different sequences of edges in different parts of the genome. This happens because some nodes on that path may be coalescent in one genomic region and not coalescent in another. However, forcing the simulation to keep a record of these nodes everywhere should fix that problem.
Ah, that's interesting - some changes in the pipeline in msprime from @GertjanBisschop to allow storing unary nodes in a granular way should help?
Ah, I wasn't aware of those changes 👀 sounds like it would be very useful indeed! This is also related to the work that @petrelharp is talking about in an upcoming tskit meeting (though I believe he is thinking about inferring these extra nodes back onto trees that don't have them).
Yes, it's all closely related. Unary nodes FTW!
One of the most important comments I got in my thesis examination was regarding our definition of IBD. Our current implementation of
ibd_segments
treats segments as distinct if they have been inherited along distinct 'genealogical paths', which are essentially sequences of edges. So in this instance,there will be 3 separate IBD segments for the sample pair (0,1), because the paths in the middle tree explicitly contain some extra intermediary nodes. (This has caused confusion for other users before: see #1479).
I was pretty sure this behaviour was desirable until I read this comment of Graham's:
This is a super valid point, and kind of major imo. In many of the tree sequences we come across, nodes like (2) and (3) will be coalescence nodes that relate our samples to some other sample that has recombined at the breakpoints (who we are perhaps uninterested in). It's bad if our definition of IBD between the pair (0, 1) depends on whether those other samples are included in the trees or not, and currently it does.
We've certainly talked about doing this before, but I'm unsure of how much effort it will take. I'm about to push up a draft PR that implements these changes in our Python
IbdFinder
, but I'm not sure this would translate well to the C code given the fancy struct that Jerome made for the output. The necessary algorithms are themselves fairly simple -- a sort method and a squash method, both very similar to those used on edge tables.pinging @jeromekelleher, @petrelharp