tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
176 stars 88 forks source link

Add option to simulate "above" local roots #2157

Open jeromekelleher opened 1 year ago

jeromekelleher commented 1 year ago

A natural addition to being able to keep a record of all genomes that we simulate through (i.e. keeping unary #2128 ) is to keep a record of the nodes that we go through after a local root has been reached.

This is simple enough to do, we just stop snipping out sections of the ancestry that have fully coalesced and change the stopping condition of the simulation.

cc @sgravel

jeromekelleher commented 1 year ago

What should we call it? stop_on_coalescence or something?

sgravel commented 1 year ago

stop_at_MRCA ?

sgravel commented 1 year ago

Perhaps simulate_above_MRCA would be less ambiguous.

Would need to make sure to catch people giving no termination criterion, though.

jeromekelleher commented 1 year ago

How about stop_at_local_mrca=True as the default? Keeping the overall stopping condition as all intervals have reached a local MRCA seems fine. The only case when we wouldn't want to do this is when we have a fixed pedigree, and I think it would be unusual for us to have full coalescence before leaving the pedigree in most situations?

But, perhaps we need to think about local and global stopping conditions explicitly?

sgravel commented 1 year ago

stop_at_local_mrca is quite clear. I'd rather be explicit about global stopping conditions. Say if someone runs a test simulation on a short span of the genome, the global stopping may be unexpected. We've had cases where we want to simulate ancestry to a specific time in the past in a DTWF model (for example to look at contributions from a given ancestor).

jeromekelleher commented 1 year ago

OK, sounds good to me. Let's call it stop_at_local_mrca so, and think about the global stopping condition separately (I'll open an issue)

jeromekelleher commented 10 months ago
                    min_overlap = 2
                    if not self.stop_at_local_mrca:
                        min_overlap = 0

                    # Update the number of extant segments.
                    if self.S[left] == min_overlap:
                        self.S[left] = 0
                        right = self.S.succ_key(left)
                    else:
                        right = left
                        while right < r_max and self.S[right] != min_overlap:
                            self.S[right] -= 1
                            right = self.S.succ_key(right)
                        alpha = self.alloc_segment(
                            left=left,
                            right=right,
                            node=u,
                            population=population_index,
                            label=label,
                        )

@GertjanBisschop and I had a look at this, and we think something like this snippet in the core merge_x_ancestors functions will do the trick.

Actually ,maybe this is more generally useful. What if we define a parameter min_local_lineages which tells us when to stop tracking the ancestry of segments along the genome? By default this would be 1 (stop when the region has coalesced). We can set it to 0 to say "never stop simulating local regions" (you can't hit zero), or if we set it to 2 (say) this would stop simulating each region when there are only two lineages left.

I'm not sure how useful the last idea is, but worth thinking about?

jeromekelleher commented 10 months ago

For reference, the current code looks like this:

                    if self.S[left] == 2:
                        self.S[left] = 0 
                        right = self.S.succ_key(left)
                    else:
                        right = left
                        while right < r_max and self.S[right] != 2:
                            self.S[right] -= 1
                            right = self.S.succ_key(right)
                        alpha = self.alloc_segment(
                            left=left,
                            right=right,
                            node=u,
                            population=population_index,
                            label=label,
                        )