Efficient interval intersections

hyanwong commented 9 months ago

In the sample_resolve algorithm, there is a comment taken from an implementation of ARG traversals by Jerome:

# JEROME'S NOTE (from previous algorithm): if we had an index here
# sorted by left coord we could binary search to first match, and
# could break once c_right > left (I think?), rather than having
# to intersect all the intervals with the current c_rgt/c_lft

I'm wondering what the most efficient thing to do is here. In particular, we have a set of intervals associated with a node u, and want to intersect them with another set of intervals in the edges above that node. If we implement #70 we can be assured that the set of intervals above u is ordered by left coordinate.

In the current implementation, when we create the set of intervals initially associated with u, we add them to a Portion object, which I imagine keeps them in order. In this case the intersection routine can be quite efficient, I imagine, and only requires a single pass for all the intervals (advancing through the 2 sorted lists). It is unclear to me if the Portion library provides an API to an efficient intersection routine like this.

In the case of the find_mrca_regions algorithm, it is more complex, because we the associated intervals may have different offsets / deltas and orientations, and the information for those needs to be kept, and passed up. Some extra thought might be required here, especially as these intervals could overlap (if there are e.g. duplications). It could be that this is another role for an IntervalDict.

duncanMR commented 9 months ago

I'm not sure it's worth optimising sample_resolve too much, since it's not something we have to run repeatedly like find_mrca_regions. In the latter case, wouldn't your suggestion in #69 mean that the edges won't necessarily be sorted by left coordinate, since we would be continually adding edges without sorting and freezing the tables?

hyanwong commented 9 months ago

I agree that sample_resolve (or a more complex tskit-like simplify() routine) will only be run every few generations, like SLiM does, whereas find_mrca_regions will be run repeatedly within each generation, so it's worth optimising for that.

You are also right that #69 might cause problems. But I think that when we are simulating in forwards time, we will add edges for each child as we create the child, so the edge table should remain ordered by child node. And presumably in forward-time we can assure that we generate children in time order, so that we retain edges sorted by time of child node (oldest first).

Even when we simplify to remove intermediate nodes (not yet coded), I think the natural ordering will be by child.

hyanwong commented 8 months ago

Note that it is definitely now the interval arithmetic that is the slowest part of find_mrca_regions, and hence the slowest part of the forward simulation framework. See #86

hyanwong / giglib

Efficient interval intersections #71