tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
175 stars 87 forks source link

Use ``ancestral_to`` field on segments instead of global lookup table. #1993

Closed jeromekelleher closed 10 months ago

jeromekelleher commented 2 years ago

A fundamental part of the work in coalescent simulation is to keep track of the amount of ancestral material remaining for each interval along the genome. In msprime we currently do this using an AVL tree (S) (see here), and we update this structure during common ancestor events (e.g., here).

It turns out this is unnecessary and we can achieve the same thing by adding an ancestral_to field to each segment, which counts the number samples it is ancestral to. Then, when we merge segments at coalescence, we simply add the ancestral_to values.

See this code for how the simulation works when we track common ancestry by segment.

This should simplify the main interval overlap routines in msprime, and also result in a significant performance boost. I think a substantial fraction of the current simulation time (say, 20% for a long genome and large sample size) is taken up with AVL tree operations to maintain S.

jeromekelleher commented 10 months ago

I think you demonstrated that this is actually detrimental to performance @GertjanBisschop - can you comment please?

GertjanBisschop commented 10 months ago

Yep, that is true. The main reason was that a lot of segments cannot be squashed when keeping track of a single ancestral_to value per segment, as squashing can only happen for segments with identical ancestral_to values.

jeromekelleher commented 10 months ago

Closing this as "this was a bad idea" then!