Closed jeromekelleher closed 10 months ago
I think you demonstrated that this is actually detrimental to performance @GertjanBisschop - can you comment please?
Yep, that is true. The main reason was that a lot of segments cannot be squashed when keeping track of a single ancestral_to
value per segment, as squashing can only happen for segments with identical ancestral_to
values.
Closing this as "this was a bad idea" then!
A fundamental part of the work in coalescent simulation is to keep track of the amount of ancestral material remaining for each interval along the genome. In msprime we currently do this using an AVL tree (S) (see here), and we update this structure during common ancestor events (e.g., here).
It turns out this is unnecessary and we can achieve the same thing by adding an
ancestral_to
field to each segment, which counts the number samples it is ancestral to. Then, when we merge segments at coalescence, we simply add theancestral_to
values.See this code for how the simulation works when we track common ancestry by segment.
This should simplify the main interval overlap routines in msprime, and also result in a significant performance boost. I think a substantial fraction of the current simulation time (say, 20% for a long genome and large sample size) is taken up with AVL tree operations to maintain S.