Support for gene conversion in SLiM engine

nspope commented 2 years ago

With #1106 the msprime engine supports gene conversion, and the rate/mean tract length parameters are stored as attributes of stdpopsim.Contig. It'd be nice to also allow gene conversion with the SLiM engine. This would require:

Having slim_makescript add a line with initializeGeneConversion(...); to the initialize() block of the SLiM script.
Reconciling differences in parameterization between the SLiM and msprime GC models. If I'm understanding the documentation correctly, msprime independently samples the locations of simple crossovers and GC events, using separately provided rates (a constant rate in the case of GC). Whereas SLiM draws breakpoints according to rates in a supplied recombination map, then randomly assigns each event to be a simple crossover or GC according to a supplied probability. So rates of GC events can vary across the sequence in SLiM, but not in msprime (currently).
Support for heteroduplex mismatch repair -- there are two additional parameters in SLiM that control the proportion of GC tracts that are complex and the nucleotide bias resulting from mismatch repair within these tracts. These don't have an influence on ancestry but do influence stored mutations (?)

Relevant docs for msprime and SLiM.

petrelharp commented 2 years ago

Do you have input on how to set this up, @fbaumdicker?

petrelharp commented 2 years ago

Also TODO would be to add a bit to slim_makescript to ask SLiM what the GC rate is and record this in ts metadata when debugging is on, so that the tests can verify that GC is being done as requested; we can copy the way that we check recomb maps and DFEs.

nspope commented 2 years ago

With regards to SLiM's ability to model bias in nucleotide composition due to heteroduplex mismatch repair: currently, neutral mutations are added to SLiM-produced tree sequences in stdpopsim._SLiMEngine._recap_and_rescale using msprime. So, I think it'd be challenging to support mismatch repair bias because we'd have to do something complicated to keep track of mutations in complex gene conversion tracts?

For example, if there are nucleotide differences between the copy/non-copy strands within a complex gene conversion tract, then mismatches that are "repaired" to match the non-copy strand will be recorded as new mutations by SLiM. This requires knowing what mutations are present on the copy/non-copy strands during a particular complex gene conversion event. I don't think it's possible to know this, if neutral mutations are added post-SLiM-simulation on a possibly recapitated tree sequence.

So maybe we shouldn't worry about supporting nucleotide bias, and should instead focus on getting something as close as possible to the gene conversion model in msprime?

petrelharp commented 2 years ago

I agree - we can think about GC bias later.

petrelharp commented 2 years ago

OK, notes here:

msprime's gc set-up is described here
SLiM's gc set-up is described in section 1.5.6 of the manual

We have two parameters added in #1106, which are passed straight to msprime.sim_ancestry:

gene_conversion_rate
gene_conversion_length

Ok, so in msprime:

lengths are Geometric with mean gene_conversion_length
at a (Poisson) rate gene_conversion_rate (that must be constant)

Now, in SLiM, we specify:

a recombation RateMap; let's say the rate at x is r(x);
lengths are Geometric in each direction (hence negative binomial) with a total mean of both directions of meanLength
a proportion nonCrossoverFraction of DSBs are resolved as non-crossovers (so crossover rate is r * (1 - nonCrossoverFraction))
a proportion simpleConversionFraction of non-crossovers that resolve as "simple" (ie not involving heteroduplex repair)

So, we want gene_conversion_length = meanLength, and simpleConversionFraction=1.0. And, with msprime's (simple) recombination rate at x equal to recombination_rate(x),

  recombination_rate(x) = r(x) ( 1- noncrossoverFraction )
  gene_conversion_rate = r(x) * noncrossoverFraction

Then we'd like to solve for r(x) and noncrossoverFraction.

However, these two models are fundamentally incompatible with non-constant rate maps, since:

msprime only simulates a constant rate of GCs
SLiM only simulates a rate of GCs that is a constant proportion of recombinations

Well, I'm glad we're noticing this before the release. The most obvious thing to do would be to set noncrossoverFraction in a given simulation so that the expected number of GCs is equal to gene_conversion_rate * sequence_length. But, maybe there's a better thing to do? Or, a better way to parameterize this?

petrelharp commented 2 years ago

The most obvious thing to do would be to set noncrossoverFraction in a given simulation so that the expected number of GCs is equal to gene_conversion_rate * sequence_length. But, maybe there's a better thing to do? Or, a better way to parameterize this?

Another option I see is to change the way we store GC rates, effectively to match SLiM. This would make sense, as SLiM's set-up is closer to the biology (as I understand it, in diploid, sexual organisms at least). So, let's say that the crossover rate is 1e-8 and the crossover fraction is 0.1; then we'd need to simulate in SLiM a recombination rate of 1e-7 and a noncrossover fraction of 0.9; in msprime we'd simulate with a recombination rate of 1e-8 and a gc rate of 0.9e-7. This almost works, except in the case (eg of EscCol) where we don't want any crossovers; since the formula is total_rate = crossover_rate / crossover_fraction, if crossover_rate = 0 (ie we set recombination rate to 0) then this doesn't work.

Note also that we already don't match SLiM because with GC turned on, "recombination rate" in SLiM means "rate of crossovers + noncrossovers", while in stdpopsim "recombination rate" means "rate of crossovers", and I don't think we want to change that, since if we did then just turning on GC would dramatically change what "recombination rate" means, counter-intuitively. (The situation is less counter-intuitive in SLiM because of the API, but in stdpopsim that one would affect the other is not so obvious.)

So, in summary, I see three ways forward:

match msprime's parameterization (what we currently do)
match SLiM's parameterization (I think we don't want to, because of the last paragraph above)
do a mix-and-match (doesn't work, as above)

... and, I've talked myself back to where we started out at. Thinking this through: what if someone wants to say that DSBs happen at rate proportional to a given recomb map, but then 90% of them resolve as GC events? If we require separate RateMaps for recomb and GC, then they'd be passing in two identical-up-to-scaling RateMaps. And, come to think of it, it would be more realistic for e.g. humans to specify just a noncrossover_fraction and pass that straight into SLiM instead of doing through the contortions to get SLiM to simulate from msprime's flat GC rate model (which we can't do anyhow).

Okay, so now here's

allow species to have either gene_conversion_rate or gene_conversion_fraction, since the first makes sense for EscCol and the second for HomSap. We'd have to figure out what to do in the cross-cases, but we can do that.

This seems right to me, but it makes things more complicated, which I don't like.

fbaumdicker commented 2 years ago

Unfortunately, I don't currently plan to include non-constant gene conversion rates in msprime, and I believe Ben recently mentioned that he also doesn't plan to allow the use of two independent recombination maps in slim. Both would resolve the incompatibility here.

It might be worth looking at the solution @jeanrjc and @bhaller came up with when simulating bacterial recombination (= gene conversion) with slim in this paper: https://doi.org/10.24072/pcjournal.72 If we could use something similar, we would even get the same tract length distribution in msprime and slim. At least for bacteria, i.e. if the recombination rate is 0 but the gc rate > 0, this might be the best solution. However, it might also be possible to similarly include constant rate gene conversions in addition to a variable recombination map in eukaryotes.

If that doesn't solve the issue: I like the simplicity of simply defining a gene_conversion_fraction, since this is how many people think about gc in eukaryotes. This would work well in the case of constant rate maps for slim and msprime. But for non-constant recombination maps, it would not be possible to simulate them in this way in msprime. The only thing we could do is to set the constant gene conversion rate in msprime to a value such that the mean number of gene conversions is the same, with all the difficulties you describe.

The more I think about it, the more I conclude that the gene_conversion_rate is a species property, while the gene_conversion_fraction is more a recombination map property, since the gene conversion fraction estimates depend on the corresponding recombination estimates.

So maybe we should distinguish between constant and non-constant recombination maps only when gc is enabled?

Constant rates --> use msprime as currently implemented, and do the same in slim with your formula above. non-constant map --> use the constant gc rates in msprime and the fractions in slim, users need to specify how to (re)calculate the rates. We should avoid the scenario where turning on gc changes the crossover recombination rates by default.

bhaller commented 2 years ago

It might be worth looking at the solution @jeanrjc and @bhaller came up with when simulating bacterial recombination (= gene conversion) with slim in this paper: https://doi.org/10.24072/pcjournal.72 If we could use something similar, we would even get the same tract length distribution in msprime and slim. At least for bacteria, i.e. if the recombination rate is 0 but the gc rate > 0, this might be the best solution. However, it might also be possible to similarly include constant rate gene conversions in addition to a variable recombination map in eukaryotes.

Possible, yes. Quite unfortunate to not be able to use the built-in gene conversion facility in SLiM, though, since it's really pretty sophisticated – even allowing for complex gene conversion tracts and heteroduplex mismatch repairs. Abandoning that in favor of a scripted approach... well, it doesn't seem optimal. But I get that you're in a bind here.

I'm not sure I understand all the issues at play enough to comment more than that.

petrelharp commented 2 years ago

I'm not worried about getting exactly the same model in msprime and SLiM - mostly, we'd like to get the same model in both because we have to make fewer decisions. However, we shouldn't fall back to a less realistic model just to get consistency across engines. (For instance, we're not restricting ourselves to neutral sims just because msprime won't do much selection.) And, it seems to me that SLiM is the more realistic one (?), so I'm not enthusiastic about doing something more complicated with a less realistic outcome.

The more I think about it, the more I conclude that the gene_conversion_rate is a species property, while the gene_conversion_fraction is more a recombination map property, since the gene conversion fraction estimates depend on the corresponding recombination estimates.

I'm not sure what you mean? Are there recombination maps that have estimated non-constant gene conversion fractions? I thought we had at best a genome wide guess at gene conversion fraction per species?

Ah, I've thought of another alternative: remove gene_conversion_rate, and special-case bacterial: so, have a flag like bacterial_recombination that would be False for things that do crossing over, and these would have gene_conversion_fraction, while it would be True for things doing bacterial recombination, and for these "recombination rate" would be implemented as gene conversion rate (with no crossing over). So, EscCol would have recombination_rate equal to what's now in gene_conversion_rate, and bacterial_recombination=True.

bhaller commented 2 years ago

Ah, I've thought of another alternative: remove gene_conversion_rate, and special-case bacterial: so, have a flag like bacterial_recombination that would be False for things that do crossing over, and these would have gene_conversion_fraction, while it would be True for things doing bacterial recombination, and for these "recombination rate" would be implemented as gene conversion rate (with no crossing over). So, EscCol would have recombination_rate equal to what's now in gene_conversion_rate, and bacterial_recombination=True.

This sounds like an excellent proposal to me. Biologically, horizontal gene transfer in bacteria is a totally different process than crossing over in diploids, and it makes sense to treat them separately. If the gene conversion support in stdpopsim eventually goes further (with, for example, heteroduplex mismatch repair and GC-biased gene conversion), treating these two cases separately will make it much easier to expand the design.

fbaumdicker commented 2 years ago

I completely agree. Using the gene conversion parameters to mimic bacterial recombination should not limit the gene conversion features in diploids. So using a bacterial_recombination flag is the most natural thing to do and will at the same time avoid confusion. We would still need the currently used gene_conversion_tract_length parameter in bacteria as well as diploids, which we might want to rename and split into two parameters? E.g. bacterial_recombination_tract_length and gene_conversion_tract_length.

I'm not sure what you mean? Are there recombination maps that have estimated non-constant gene conversion fractions? I thought we had at best a genome wide guess at gene conversion fraction per species?

While there are some attempts to estimate the local gene conversion fractions along chromosomes, e.g. here: https://doi.org/10.1038/ejhg.2013.30 , that was not what I meant. I am concerned that if we have different recombination maps in the catalog for the same species, we might need to also use different values for the gene_conversion_fraction, depending on the recombination map and the estimation method for the gene conversion parameters. The same applies to the "gene_conversion_tract_length". This might also differ between different recombination maps. Perhaps it would be best to have both, a default value for "gene_conversion_fraction", and a "gene_conversion_fraction" for each recombination map. This is basically what we currently do with the recombination rate itself anyway.

petrelharp commented 2 years ago

We would still need the currently used gene_conversion_tract_length parameter in bacteria as well as diploids, which we might want to rename and split into two parameters? E.g. bacterial_recombination_tract_length and gene_conversion_tract_length.

I'd vote for leaving them named the same thing, since no species can have both properties, and they're implemented just the same. It'll be less confusing then adding an additional parameter.

Perhaps it would be best to have both, a default value for "gene_conversion_fraction", and a "gene_conversion_fraction" for each recombination map. This is basically what we currently do with the recombination rate itself anyway.

Is your idea to make things ready for the future in which recombination maps also estimate GC parameters/maps? Currently, are there any recombination maps that also have estimates of GC things (like, described in the same publication or something?)? We can always add the property in the future if it turns out to be important - but, is there any situation now where it'd be important to have?

petrelharp commented 2 years ago

Synthesizing the above discussion, to make things concrete (but with @fbaumdicker's suggestions still pending), here's a proposal:

add a bacterial_recombination flag and a gene_conversion_fraction parameter (must be >= 0 and strictly < 1)
if the flag is True then "recombination rate" means effectively GC rate, and the gene_conversion_tract_length must be defined; and gene_conversion_fraction must be not defined (ie None)
furthermore if the flag is True then simulations will automatically turn on "gene conversion" (ie, bacterial recombination)
if the flag is False then things work as currently implemented: by default there is no GC, and only if gene conversion is requested in the call to simulate( ) is it turned on
if gene conversion is turned on, then, in SLiM, the gene_conversion_fraction is passed straight to SLiM and the recombination rate map is multiplied by 1/(1-gene_conversion_fraction) (so that the rate of crossing over is unchanged)
if gene conversion is turned on in msprime then we work out what would be the average rate across the genome of gene conversion under SLiM's scheme (which would be (total_map_length / sequence_length) * (gene_conversion_fraction / (1 - gene_conversion_fraction)), I think), and add gene conversion at this rate

bhaller commented 2 years ago

We would still need the currently used gene_conversion_tract_length parameter in bacteria as well as diploids, which we might want to rename and split into two parameters? E.g. bacterial_recombination_tract_length and gene_conversion_tract_length.

I'd vote for leaving them named the same thing, since no species can have both properties, and they're implemented just the same. It'll be less confusing then adding an additional parameter.

I thought they were going to be implemented quite differently? Bacterial recombination would be implemented using addRecombinant() as in section 16.13 in the SLiM manual, gene conversion would be implemented using SLiM's built-in gene conversion facilities. I think of them as being completely different processes; gene conversion is gene conversion, bacterial recombination is horizontal gene transfer. I personally find it a bit confusing for the two to be conflated; they are different biological processes and should be treated as such. My two cents. :->

fbaumdicker commented 2 years ago

I thought they were going to be implemented quite differently?

That is true for forward-in-time simulations, but bacterial recombination has often been mimicked by gene conversion in coalescent-based backward simulations. So it depends a little bit on the user's background whether using the gene_conversion_rate parameter for bacterial recombination will be confusing or not.

fbaumdicker commented 2 years ago

Is your idea to make things ready for the future in which recombination maps also estimate GC parameters/maps? Currently, are there any recombination maps that also have estimates of GC things (like, described in the same publication or something?)? We can always add the property in the future if it turns out to be important - but, is there any situation now where it'd be important to have?

I am not really an expert regarding recombination maps but thought that the current estimates of gene conversion fractions would depend on the assumption of which recombination map is assumed or co-estimated. There are three related publications I know of. One very recent paper by Derek Setter and Konrad Lohse, where they estimate chromosome-specific recombination and gene conversion rates. https://doi.org/10.1093/genetics/iyac100 And one publication from 2007 by S Myers and Gill McVean on a method to jointly estimate crossover and non-crossover rates. https://doi.org/10.1534/genetics.107.078907 There is also a publication estimating two separate maps for crossing over and gene conversion in Drosophila: https://doi.org/10.1371/journal.pgen.1002905

However, I agree that this would be easy to add later, so maybe it is best to focus on the species parameters for now. Users would also always be able to use custom values for the gene conversion fractions.

fbaumdicker commented 2 years ago

Synthesizing the above discussion, to make things concrete (but with @fbaumdicker's suggestions still pending), here's a proposal: ...

I think the proposal is fine but we should probably carefully decide which mechanisms to entangle. Personally, I like the bacterial recombination flag idea and using the recombination rate for the bacterial recombination initiation rate. I am also slightly in favor of using as few parameters as possible which would mean using the gene_conversion_tract_length also for the length of bacterial homologous recombination tracts.

petrelharp commented 2 years ago

I thought they were going to be implemented quite differently? Bacterial recombination would be implemented using addRecombinant() as in section 16.13 in the SLiM manual, gene conversion would be implemented using SLiM's built-in gene conversion facilities. I

Hm: the main difference I see between 16.13 and naively just doing crossoverFraction=1.0 is handling ploidy correctly, which would be real nice but a whole different thing. There's also a different distribution of horizontally exchanged tract lengths (uniform instead of negative binomial).

So - I think that something like 16.13 probably is the right thing to do; however, so as to not snowball changes here, I'd like to just keep it simple, as I describe above - it's not perfect, but SLiM will not be less realistic than msprime, anyhow.

bhaller commented 2 years ago

What do you mean by "handling ploidy correctly"? Are you guys planning to simulate bacteria with a diploid model?

fbaumdicker commented 2 years ago

There's also a different distribution of horizontally exchanged tract lengths (uniform instead of negative binomial).

But this would be super easy to change using a uniformly distributed first breakpoint and a geometrically distributed length to the next breakpoint. By changing only these two lines of code in the manual in 16.14

// draw two distinct locations; redraw if we get a duplicate
do breaks = rdunif(2, max=L-1);
while (breaks[0] == breaks[1]);

The manual even suggests to do so:

The details of the breakpoint generation here might need to be modified in a more realistic model. Here we draw the start and end positions of the transfer region independently, but perhaps it would be better to draw the start location randomly and then draw a transfer length from a geometric distribution or some other distribution

So here it would be possible to make it exactly as msprime, but honestly, I think that the negative binomial will not make a big difference compared to the geometrically distributed tract lengths. So for the moment it might not make a huge difference to simply use crossoverFraction=1.0, but in the long run it might be clever to have separate setups for bacterial_recombination True and False.

petrelharp commented 2 years ago

What do you mean by "handling ploidy correctly"? Are you guys planning to simulate bacteria with a diploid model?

That is, in fact, currently what happens. PRs welcome! Personally, dealing with sex chromosomes correctly (which we currently don't) is higher on my list of things to do.

in the long run it might be clever to have separate setups for bacterial_recombination True and False.

Agreed! And, this will set us up for that.

bhaller commented 2 years ago

What do you mean by "handling ploidy correctly"? Are you guys planning to simulate bacteria with a diploid model?

That is, in fact, currently what happens. PRs welcome! Personally, dealing with sex chromosomes correctly (which we currently don't) is higher on my list of things to do.

Do you mean simulating just the X, or just the Y? Or do you mean simulating the X and the Y and one or more autosomes?

petrelharp commented 2 years ago

Do you mean simulating just the X, or just the Y? Or do you mean simulating the X and the Y and one or more autosomes?

Right now if you ask to simulate the X or the Y (or the Z or W) you just get an autosomal simulation (and a warning saying that it's not right).

bhaller commented 2 years ago

Do you mean simulating just the X, or just the Y? Or do you mean simulating the X and the Y and one or more autosomes?

Right now if you ask to simulate the X or the Y (or the Z or W) you just get an autosomal simulation (and a warning saying that it's not right).

Gotcha. I was wondering because if you want to support multiple-chromosome simulations it would make life easier if you waited for SLiM to add support for that first. :-> But if it's just simulating the X or the Y by itself, that is already supported, as you know, so that would hopefully be easier to add in stdpopsim. :->

petrelharp commented 2 years ago

I'm going to go ahead with this proposal, but let me know if you're not happy with it!

bhaller commented 2 years ago

I'm fine with whatever choice you folks settle on, just giving input. :->

igronau commented 2 years ago

I'm making (final) edits to the adding species paper, and realized after discussion with @fbaumdicker that we need to incorporate the changes you mention here in the manuscript text. Currently, the section on Updates to stdpopsim and the species catalog includes a short paragraph on gene conversion specifying how the parameterization is set (using Franz's original setup). I'm not sure that we want to dive into the detail of the different cases because this will distract readers from the main point, and might also change in the future. Currently, the only concrete example we give is for bacteria (E. Coli). So, I thought that we'd mostly focus on parameterization for bacteria and archea, which is fairly simple -- with the bacterial_recombination flag set and the gene_conversion_tract_length set to the appropriate value. We could also mention that gene conversion may be used in other species together with crossover recombination, and parameterization there is more complex (without diving into the details). That leads to another question: do we have an example in the catalog for a species where we model gene conversion and crossover recombination?

popsim-consortium / stdpopsim

Support for gene conversion in SLiM engine #1332