Open hyanwong opened 9 months ago
I'm not sure it's worth optimising sample_resolve
too much, since it's not something we have to run repeatedly like find_mrca_regions
. In the latter case, wouldn't your suggestion in #69 mean that the edges won't necessarily be sorted by left coordinate, since we would be continually adding edges without sorting and freezing the tables?
I agree that sample_resolve
(or a more complex tskit-like simplify()
routine) will only be run every few generations, like SLiM does, whereas find_mrca_regions
will be run repeatedly within each generation, so it's worth optimising for that.
You are also right that #69 might cause problems. But I think that when we are simulating in forwards time, we will add edges for each child as we create the child, so the edge table should remain ordered by child node. And presumably in forward-time we can assure that we generate children in time order, so that we retain edges sorted by time of child node (oldest first).
Even when we simplify to remove intermediate nodes (not yet coded), I think the natural ordering will be by child.
Note that it is definitely now the interval arithmetic that is the slowest part of find_mrca_regions, and hence the slowest part of the forward simulation framework. See #86
In the
sample_resolve
algorithm, there is a comment taken from an implementation of ARG traversals by Jerome:I'm wondering what the most efficient thing to do is here. In particular, we have a set of intervals associated with a node u, and want to intersect them with another set of intervals in the edges above that node. If we implement #70 we can be assured that the set of intervals above u is ordered by left coordinate.
In the current implementation, when we create the set of intervals initially associated with u, we add them to a Portion object, which I imagine keeps them in order. In this case the intersection routine can be quite efficient, I imagine, and only requires a single pass for all the intervals (advancing through the 2 sorted lists). It is unclear to me if the Portion library provides an API to an efficient intersection routine like this.
In the case of the
find_mrca_regions
algorithm, it is more complex, because we the associated intervals may have different offsets / deltas and orientations, and the information for those needs to be kept, and passed up. Some extra thought might be required here, especially as these intervals could overlap (if there are e.g. duplications). It could be that this is another role for an IntervalDict.