tskit-dev / msprime-1.0-paper

Publication describing msprime 1.0
4 stars 20 forks source link

Figure: impact of including gene conversion #44

Closed hyanwong closed 3 years ago

hyanwong commented 3 years ago

E.g. for human/mammalian levels of GC, which are usually about 10x the recombination rate.

Should show the effect on runtime (e.g. of a stdpopsim model) and on the size of the resulting tree sequence.

We might want to also show how much of the genome is actually affected by this (i.e. not much => you can often get away with not doing it), but that's getting into the weeds of analysing GC itself, rather than the implementation in msprime, so possibly not worth it.

fbaumdicker commented 3 years ago

Yes, that would be interesting to consider.

I am not sure when you can confidentially not include GC. Certainly not when you are interested in short-range LD. In Williams et. al. 2015 https://doi.org/10.7554/eLife.04637 it is estimated that

roughly 17 (95% CI 13–21) variable sites are expected to experience NCO in each meiosis. This estimate is on the same order as the number of sites affected by de novo mutation in each generation

hyanwong commented 3 years ago

Useful reference, thanks.

I guess most people simply consider GC events as a bit like an extra mutational process (rather than a genealogical process).

The total amount of the genome / ancestry affected by crossovers is, however, relatively small, I seem to recall. I was going to simulate using a rate of 10 times the mutation rate with an exponentially distributed tract length of mean length 300bp, citing Jeffreys 2004, but maybe there are better params and more recent references to use.

hyanwong commented 3 years ago

Some rough figures for you @jeromekelleher (updated with 20 replicates). Seems like adding the same rate of recombination in one simulation as (GC+recombinations) in another is not really the same. It takes +- twice as long in the first case, and produces about half the number of trees:

msprime.sim_ancestry(
    1000, sequence_length=1e7, recombination_rate=1.25e-8, population_size=20000,
)
  1. Basic sim av: 3.14 secs, 74819.8 trees

    msprime.sim_ancestry(
    1000, sequence_length=1e7, recombination_rate=1.25e-8, population_size=20000,
    gene_conversion_rate=1.25e-7, gene_conversion_tract_length=300,
    )
  2. Sim with 10x GC added av: 143.90 secs (46x longer), 1456293.35 trees (~20 times larger)

    msprime.sim_ancestry(
    1000, sequence_length=1e7, recombination_rate=1.25e-8 + 1.25e-7, population_size=20000,
    )
  3. Sim with recombination = value of rho+GC from sim 2: av 240 secs (1.67x longer than sim 2), 792691.0 trees (slightly over half as many as sim2)
hyanwong commented 3 years ago

Here's an example of keeping the total recombination rate (crossover + GC) the same at 1e-7, but varying the proportions of each. The human rate is roughly at 0.9 (recombination_rate=1e-8, gene_conversion_rate=9e-8). Not sure if this is the right set of params, and we probably shouldn't use a 2-axis scale in the final plot. But it's not the pattern I think we were expecting

Screenshot 2021-07-17 at 08 08 08
import msprime
import numpy as np
import time

for param in range(11):
    times = []
    num_trees = []
    rho = 1e-7 * param/10
    gc = 1e-7 - rho
    start = time.time()
    for ts in msprime.sim_ancestry(
        1000, sequence_length=1e7, population_size=20_000,
        num_replicates=10,
        recombination_rate=rho,
        gene_conversion_rate=gc,
        gene_conversion_tract_length=300,
    ):
        times.append(time.time()-start)
        num_trees.append(ts.num_trees)
        start = time.time()
    print(f"rho={rho:g}, gc={gc:g}: av took {np.mean(times)} secs for {np.mean(num_trees)} trees")
jeromekelleher commented 3 years ago

This is roughly making sense to me. I guess GCs with such a short tract length is (very) roughly equivalent to two recombination events, in terms of it's affect on the ancestral material and how it's distributed, so maybe we should take that into account when computing the rates? So, r_equiv = r + gc_rate / 2 or something?

That would be a very rough approximation, because it's quite complicated how the ancestral material where recombinations/gc events can occur increases as the result of one of these events.

We probably don't need to get too deep into it, but the basic take home that adding in GC will slow things down and create more trees in the basically same way as adding more recombination, is what we want to get across.

hyanwong commented 3 years ago

This is roughly making sense to me. I guess GCs with such a short tract length is (very) roughly equivalent to two recombination events

Oh yes, of course. I'll divide by 2 as you say.

We probably don't need to get too deep into it, but the basic take home that adding in GC will slow things down and create more trees in the basically same way as adding more recombination, is what we want to get across.

Yes, but it's interesting that the time actually drops, right? So in terms of CPU, it's not equivalent to simply adding more recombinations (I think?)

jeromekelleher commented 3 years ago

Yes, but it's interesting that the time actually drops, right? So in terms of CPU, it's not equivalent to simply adding more recombinations (I think?)

I think this is because a recombination is more likely to result in trapped genetic material. I guess we could quantify this by doing a full arg simulation, and counting the number of RE nodes? (Although, I guess we should have GC nodes in there too, to be consistent)

edit I think we error out on full arg sims if gc != 0, so we can't do it that way.

hyanwong commented 3 years ago

Here's the updated data with 1Mb not 10- you're spot on with each GC counting as 2x recombinations:

Screenshot 2021-07-18 at 08 36 15

roughly realistic human-like figures represented by the last plot point but one.

import msprime
import numpy as np
import time

for param in range(21):
    times = []
    num_trees = []
    rho = 2e-7 * param/20
    gc = (2e-7 - rho) / 2
    start = time.time()
    for ts in msprime.sim_ancestry(
        1000, sequence_length=1e6, population_size=20_000,
        num_replicates=10,
        recombination_rate=rho,
        gene_conversion_rate=gc,
        gene_conversion_tract_length=300,
    ):
        times.append(time.time()-start)
        num_trees.append(ts.num_trees)
        start = time.time()
    print(f"rho={rho:g}, gc={gc:g}: av took {np.mean(times)} secs for {np.mean(num_trees)} trees")
fbaumdicker commented 3 years ago

Yes, but it's interesting that the time actually drops, right? So in terms of CPU, it's not equivalent to simply adding more recombinations (I think?)

I think this is because a recombination is more likely to result in trapped genetic material. I guess we could quantify this by doing a full arg simulation, and counting the number of RE nodes? (Although, I guess we should have GC nodes in there too, to be consistent)

edit I think we error out on full arg sims if gc != 0, so we can't do it that way.

I don't get why recombinations are more likely to produce trapped genetic material? Two gc events that are subsequently coalesced would also produce trapped material, wouldn't they? But the gc events themselves, are most of the time without any effect on the ancestral graph if they fall in such (long enough) trapped regions. In these cases, the event is much faster resolved, as no rates have to be updated.

Coincidentally, I have already added a short sentence in the gene conversion section that one gc event corresponds to two recombination events and a coalescence event. So if you plan to add something in that direction you could build upon that part.

hyanwong commented 3 years ago

As an overview, as a user I think I would want to know, if adding GC, that it would increase the size of the tree sequence by the equivalent to adding twice the GC rate to the recombination rate, but it would only increase the run time by XXX, where I have some handle on how large XXX is likely to be (possibly also in terms of the extra equivalent recombination rate).

hyanwong commented 3 years ago

edit I think we error out on full arg sims if gc != 0, so we can't do it that way.

If so, I think we should note this in the record_full_arg docstring or the [Ancestral Recombination Graph](https://tskit.dev/msprime/docs/stable/ancestry.html#ancestral-recombination-graph] doc section, shouldn't we?

jeromekelleher commented 3 years ago

I don't get why recombinations are more likely to produce trapped genetic material?

So I guess the bit that we snip out is likely to contribute trapped material much the same way as a recombination, but the "donor" segment won't? So, if a recombination event happened, we'll have generate two ancestral segments that are sparse for ancestral material, but with GC we only generate one. Does that make sense to you @fbaumdicker?

We could do some simulations on this by using the simulator object to count various things (sidestep the ARG limitation), if we're interested.

hyanwong commented 3 years ago

I just tried @jeromekelleher and I don't get an error using GC with record_full_arg?

fbaumdicker commented 3 years ago

So I guess the bit that we snip out is likely to contribute trapped material much the same way as a recombination, but the "donor" segment won't? So, if a recombination event happened, we'll have generate two ancestral segments that are sparse for ancestral material, but with GC we only generate one. Does that make sense to you @fbaumdicker?

Ah, yes that makes sense @jeromekelleher. Although the lineage with the bit we snip out will produce trapped material to the left and the right of the snip. I think another aspect of this is that gene conversion without any recombination could in principle result in more trapped ancestral material than a simulation with recombination. If no recombination is included the merger of two previously created gc tracts will always create a segment with a large gap in between the snips. Such segments will be frequently split apart again by recombination, removing the trapped material in between until the next merger occurs. Without recombination, while the total number of lineages will be smaller, the number of segments with large gaps will be higher.

Admittedly, as my use case of bacterial evolution does not involve recombination, I am only interested as long as I do not need to spend much time on the analysis.

However, I already added a counter for so-called noneffective gc events within the trapped material. So if you are interested you can have a look at num_noneffective_gc_events, which is initialized here. https://github.com/tskit-dev/msprime/blob/46ad46fefef56d00d549502ac0e9679b02aa6005/lib/msprime.c#L3722

fbaumdicker commented 3 years ago

I just tried @jeromekelleher and I don't get an error using GC with record_full_arg?

Hm, but maybe we should? The gene conversion part is not affected by the record_full_arg flag. I did not think about whether this is necessary and what I would have to add in this case, yet.

hyanwong commented 3 years ago

I just tried @jeromekelleher and I don't get an error using GC with record_full_arg?

Hm, but maybe we should? The gene conversion part is not affected by the record_full_arg flag. I did not think about whether this is necessary and what I would have to add in this case, yet.

Yes, I think we probably should error out if we don't account for it.

jeromekelleher commented 3 years ago

Yes, we missed this one, I've created an issue: https://github.com/tskit-dev/msprime/issues/1774

hyanwong commented 3 years ago

So we can reasonably claim (both from experiment and theory) that the effect of including GC at rate g to the number of trees is simply equivalent to increasing the recombination rate r to r + 2g. I think we can just say that. However, the runtime costs are not so obvious, I think. Here's another way of looking at it. Red line is where all the rec rate is crossover recombination. Black is where the crossover rec rate is set to 1e-8 and the change is all in the GC rate

Screenshot 2021-07-20 at 13 27 48
jeromekelleher commented 3 years ago

What do you think @fbaumdicker, would a version of this go well as a two-panel figure along with the SimBac comparison?

fbaumdicker commented 3 years ago

Yes, why not? This would show the performance for the two most common use cases: human-like data and bacterial recombination. We just have to make clear that one of the panels includes recombination and the other does not.

hyanwong commented 3 years ago

How do we want the scales on this plot. Had we decided for or against log time (I know Jerome isn't keen).

fbaumdicker commented 3 years ago

It would be an easy change, so feel free to create a log version for discussion. But in the other thread @jeromekelleher and I sort of agreed that using the normal scaling gives a better impression of the massive speed improvement of msprime.

jeromekelleher commented 3 years ago

Yes, let's stick with the natural scale for the bacterial comparison.

Why do you want to log scale here Yan?

hyanwong commented 3 years ago

Only because we might want to show the effect of "human-like" GC values of 1e-7 up to 5e-6 combined with a crossover rate of 1e-8. I chose the GC values to be +- equally spaced on a log scale (e.g. 0, 1e-8, 2e-8, 5e-8, 1e-7, 2e-7, 5e-7, 1e-6, 2e-6, 5e-6) which seemed sensible. Also when we get to GC of 5e-6, the timings for that datapoint dwarf everything else. But I'm happy with whatever other people prefer.

fbaumdicker commented 3 years ago

We could also use a log scale in the human gc panel and a natural scale in the bacterial simulator panel?

jeromekelleher commented 3 years ago

I don't think figure this is worth the effort, to be honest. We've not really studied it properly (i.e., outside human parameter regimes), and stand a good chance of misleading people with overly broad statements. We have no real idea how this is going to behave for varying population sizes, for example. The point is to help people predict how long a given simulation will take, and we can give some good guidance on that.

What we can say is that (a) we expect gene conversions to generate breakpoints at roughly twice the rate of crossovers, and output tree sequences will have correspondingly more edges and trees; (b) Since gene conversions don't generate as much trapped material as crossovers, simulations probably won't take as long as the equivalent amount of crossover; and (c) the general advice of predicting your run-time by running simulations for smaller genome sizes and fitting a quadratic is probably good, since this should at worst give you an over-estimate.

hyanwong commented 3 years ago

Fair point. If we can simply say (b) then we probably don't need a figure. But I wonder if we should indicate that this is borne out by (albeit limited) testing?

jeromekelleher commented 3 years ago

Closed in #118.