Comparison with bacterial simulators

jeromekelleher commented 3 years ago

We would like to compare the performance of running msprime with gene conversion with some specialised bacterial simulators. See the original thread for details:

Originally posted by @fbaumdicker in https://github.com/tskit-dev/msprime-1.0-paper/issues/4#issuecomment-713574381

@fbaumdicker, could you sketch out what would be needed here please? We're looking for a single figure here, that won't take too long to run or involve too much work.

fbaumdicker commented 3 years ago

We need to make a few decisions here:

Do we compare only with other backward (coalescent) simulators? Or do we also include forward simulation tools?

If we limit ourselves to backward simulators, I think a comparison with ms and SimBac will suffice.

Otherwise, we would have to run the BacterialSlimulation tool based on slim, which will presumably take a long time just for the simulations. For the forward simulator CoreSimul I mentioned in #4, I just noticed that it is simulating a different process. Namely, gene conversion/recombination is only allowed between ancestral lineages at the time of the recombination event. So I would exclude it here.

I suggest we create a simple comparison only with ms and SimBac, where we fix all parameters except the sample size. SimBac is easy to install and available here: https://github.com/tbrown91/SimBac

Has someone created a script that we are already using to generate the other running time comparisons with ms? I could change the corresponding parameters in such a script and run the comparison with ms and SimBac (which is also executed from the command line) locally on my PC. Or do we want to simulate all comparisons of the manuscript on the same hardware? In that case, I could still adapt the script.

The figure would then plot the running time of msprime, ms and SimBac vs. sample size.

Other parameters could be chosen based on E coli? That would mean 4.6 mps = 4.6 10^6 positions and a per site mutation and gene conversion initiation rate of \mu N_e = \rho * N_e = 0.01602 and a mean tract length of 542 bp.

For simplicity, we could also just use 5 mps as genome length, a gene conversion and mutation rate of 0.015 and a mean tract length of 500.

For SimBac we have to decide which output we want to generate (where each output presumably takes some extra time) SimBac can output:

The clonal tree in nwk format.
All local trees in nwk format.
A fasta file of the generated sequences
A DOT file of the ARG
A log file of recombination break points and origins and recipient lineages of the recombinant material (Only recommended for small ARGs)

Depending on the choice of output the running time differs significantly. So the question is which output do we generate with SimBac and to which output of msprime (and also ms) do we compare this? How do we compare ms and msprime, in the other scenarios?

jeromekelleher commented 3 years ago

Thanks @fbaumdicker. I agree we should keep it simple and stick to backwards time simulations. It's not clear to me whether it's worth the bother of comparing with ms, since we pretty much know we'll be faster and it adds a lot of complexity (probably ms will be so slow it won't run in feasible time for interesting cases on other two).

So, I'd just compare with SimBac. In terms of output, I would just output the clonal tree probably, as that's likely the fastest. It's extremely difficult to do a fair comparison, so let's skew things as much in favour of SimBac as we can.

fbaumdicker commented 3 years ago

Yes, great. I will try to do this @jeromekelleher. Is there any script I can use as a template for the comparison?

Should I just use this one https://github.com/tskit-dev/msprime-1.0-paper/blob/8ca2874ae6f0d5e25f9bb1d568b89d8297389cd7/evaluation/generate_performance_data.py from @daniel-goldstein ?

jeromekelleher commented 3 years ago

I think that's a good place to start @fbaumdicker. I guess the first thing we'd want to do is to make sure we're really simulating the same thing, so is it worth updating msprime's verification.py to include a simple comparison with SimBac?

fbaumdicker commented 3 years ago

I don't think that is necessary. The main issue will be to find the scaling of parameters that corresponds to the same scenario. However, a comparison with SimBac in the verification.py file will not give us any further insides or checks into our own simulation procedures. It would just verify that we have chosen the correct scaling. Other aspects, like the total number of trees, etc. are already verified by the already existing verifications. I think it suffices to add a short check for the correct scaling into the generate_performance_data.py script.

fbaumdicker commented 3 years ago

simbac

This is the preliminary version of the graph comparing the performance of SimBac and msprime. The gene conversion rate needs to be four times larger in SImBac than in msprime to match each other. This figure is at the moment generated based on one simulation per data point and took roughly 24h to simulate. If we want to get rid of the random noise I will need to simulate more replicates. However, before doing so I would like to get some feedback regarding the choice of parameters, etc. Should I include other sample sizes or completely different parameters? How much time do I have to simulate more replicates?

Currently, I used: 4.5 mps as genome length, a gene conversion rate of 0.015 (and 0.015/4.0 in msprime) and a mean tract length of 500. sample sizes are 10, 20, 30, ... 100, 200, 300, ..., 1000, 2000, 3000, ..., 10000

fbaumdicker commented 3 years ago

While playing around with SimBac and msprime parameters, I noticed that SimBac is faster for smaller genome sizes for small sample sizes. Maybe this could be another aspect of the discussion in #28 ? The following figure is the same as above but with a genome length of 10000 instead of 4.5 mb and a log scale on the y axis.

small_genome

jeromekelleher commented 3 years ago

This is great, thanks @fbaumdicker ! Re parameters it would be good to simulate parameters for some organism, with the appropriate GC and genome size.

How realistic is a genome of size 10000?

jeromekelleher commented 3 years ago

Also, what are the absolute values of the times in the second plot? I'm guessing that all the cases where SimBac is faster than msprime are very fast anyway, so it's not particularly important (where would a "1 second" horizontal line go in this plot?)

fbaumdicker commented 3 years ago

The absolute values in both plots are given in seconds. So only SimBac with sample sizes 9000 and 10000 takes longer than 1 second in the second plot. The genome size of 10000 is not realistic at all. Only a virus or a plasmid could be that small. I used this just for testing the pipeline etc. However, some people might use such values if they are only interested in a part of the genome.

The parameters of the first plot correspond roughly to the E coli parameters we are using in stdpopsim. I used slightly different parameters to use values that are rounded and close or identical to the default values of SimBac. But parameters are undoubtedly in a realistic range. However, I could also use the exact values from stdpopsim and an additional reference for the gc rate?

jeromekelleher commented 3 years ago

That's excellent, thanks @fbaumdicker. I think it's reasonable to say "we simulated e-coli under the best current known parameters" for both simulators with varying sample size. We don't need to say msprime is faster than SimBac in every possible scenario, we're just illustrating that there's quite a large margin in the performance in this biologically interesting case.

Sound good?

Assuming we're agreed, then I think we need more replicates to get better data. We can reduce the maximum sample size (or, limit the ones we have to run SimBac for by putting a cutoff on the y axis at (say) 1 hour. Shouldn't take more than (say) a day to run the whole thing.

Also, if we're going to show the x-axis on a log-scale, it's good to use np.logspace so that the points are evenly space. I think about 20 or 30 points should be fine.

fbaumdicker commented 3 years ago

That sounds good, @jeromekelleher. I think we can limit the sample size to 1000, there are currently anyway not such large real datasets available.

I have no explicit preference for or against a log-scale at the x-axis. But I think it would be good to use the same scaling as in the other performance charts. Are you going to use a log-scale in the other performance charts?

jeromekelleher commented 3 years ago

Since it's a fairly small sample size anyway, let's stick to a linspace so. I don't think we'll be comparing to anything that can handle large sample sizes, so logspace won't be needed, I'd say.

jeromekelleher commented 3 years ago

@fbaumdicker, #32 added a "gene conversion" section and a placeholder figure. It would be great if you could either add some text to the section or get some data in for the comparison - do you think you'll be able to get to this some time soon?

fbaumdicker commented 3 years ago

Yes, I am putting this on my weekend list.

I have meanwhile simulated a few more datasets for the comparison figure. Here I am using the mean of 5 repetitions for each data point. However, it is still rather noisy for SimBac (also for msprime, but this is not so obvious for this scale). The simulations for sample size up to 1000 take approximately 3 days, if we limit ourselves to 500 it is a bit more than 24 hours.

SimBacVsmsprime_1000 SimBacVSmsprime_500

So it won't be possible to get rid of the noisy pattern without increasing the simulation time significantly. An alternative approach might be to plot the 5 simulated datapoints separately instead. (I will need to resimulate for this). As we are only comparing two tools here, I think the figure would still be clear and easy to interpret. I could e.g. go for sample size up to 1000 and three repetitions at each sample size?

jeromekelleher commented 3 years ago

I hadn't expected the difference to be quite so stark... Maybe we could simulate say 10% of the e coli genome so?

What do others think? I'd err on the side of getting smooth data and running smaller simulations, but it is good to run on the actual parameters for E-coli.

I guess what we could do is run sufficiently small simulations to make running replicates of SimBac practical, and then report in the text that simulating 1000 E-coli samples takes 3 days in SimBac vs x minutes in msprime?

petrelharp commented 3 years ago

I'd say there's no need for lots more computation just to get a smooth curve. It's obvious more or less what's going on. Maybe these plots, but on a log scale? It's more important to show the numbers that people care about than to get a well-balanced plot.

jeromekelleher commented 3 years ago

Yeah, that's fair. I'm not sure about log scales for CPU time plots though - it's nowhere near as visceral, and what can look like small differences on a log scale can be the difference between practical and not.

How about you upload the data you have for the max 500 case @fbaumdicker, and describe the simulations done? There's probably no point in doing anything else, as Peter says, it's pretty clear what's going on.

fbaumdicker commented 3 years ago

As mentioned in #43 I messed up the gene conversion rate and had to double it to match the estimated parameters for E. coli. Now SimBac takes already for a sample size of 50 more than 2 hours while msprime needs approx. 2 minutes. So maybe I will have to change the range of considered sample sizes or leave out SimBac for higher values. I will keep you updated here.

fbaumdicker commented 3 years ago

Another related issue is that I found fastSimBac, which I wasn't aware of yet. fastSimBac is based on MaCS and uses an SMC-like approach to simulate gene conversion faster. Maybe it would be a good idea to add it here. Especially since gene conversion does not suffer from something like "trapped" gene conversions. However, I am unsure which parameters to chose in fastSimBac to match those used in msprime and SimBac so far.

I used the number of different trees in a fast simulated scenario to check the scaling of the gene conversion rate in SimBac and msprime. SimBac uses gamma/2 as its base gene conversion rate and msprime uses a diploid populations size. Thus the number of different trees simulated matches when the parameter in msprime is set to four times the gene conversion rate in SimBac. But for fastSimBac the number of different trees does not match the number of trees in SimBac and msprime, even when I use a doubled or halved gene conversion rate. I know that no one here knows the details of fastSimBac, but maybe @jeromekelleher , @petrelharp , or @hyanwong knows whether there is a similar discrepancy for the number of different trees found for the SMC approach for recombination compared to Hudsons ms/msprime with recombination?

If that is the case I would need another parameter to check when the gene conversion rates are equal. Otherwise, I will have to understand fastSimBac in more detail.

jeromekelleher commented 3 years ago

whether there is a similar discrepancy for the number of different trees found for the SMC approach for recombination compared to Hudsons ms/msprime with recombination?

For the SMC(') the expected number of trees is identical to the full coalescent, but the variance is reduced a litte. These are the plots from msprime's evaluations:

mean var

jeromekelleher commented 3 years ago

You do need to be careful about the distinction between breakpoints and the number of trees, as output by msprime though. Programs like ms will output two adjacent trees even if they are identical (because of an ARG "diamond"), whereas msprime squashes them together. I guess that's pretty unlikely in GC simulations, though.

jeromekelleher commented 3 years ago

Regarding the parameterisation for msprime @fbaumdicker, it would be less confusing if you set ploidy=1 : https://tskit.dev/msprime/docs/latest/ancestry.html#ploidy

jeromekelleher commented 3 years ago

It looks like fastSimBac was published, so we probably should do a comparison (otherwise it'll look like we're making a straw man of SimBac). I find it surprising that the parameterisation of SimBac and fastSimBac would be so different, since they're written by the same person AFAIK.

I guess we could try to unearth the code used for the comparison between SimBac and FastSimBac in the FastSimBac paper (if it exists)?

fbaumdicker commented 3 years ago

Thanks for the plots and clarifications.

I was aware of the breakpoints vs. trees issue and made sure that for the gc parameters I am using the probability for two adjacent identical trees is close to zero. So I would exclude this for the moment.

I did not find a script for the SimBac/fastSimBac comparison in the repo, so I contacted Nicola DeMaio (an author of fastSimBac (and SimBac)), so let's see whether he knows what the issue might be. Maybe I need a burn-in part for the BSMC in fastSimBac?

jeromekelleher commented 3 years ago

Perfect, thanks @fbaumdicker

jeromekelleher commented 3 years ago

A question for you @fbaumdicker - if MaCS supports gene conversion, why did they bother making FastSimBac? Is there a paper where people do the theory for adding GC to the SMC, or is it a trivial change?

jeromekelleher commented 3 years ago

Running the full E-coli simulations @fbaumdicker it seems that fastSimBac is faster than simbac once we get to longer sequences all right, so I think it makes sense to run all three. I'll see how things are running and try to cut the parameter values off at something reasonable.

It's a little bit worrying that fastsimbac is producing systematically fewer trees than msprime or simbac, but all we can do is input the parameters and measure the run time. At least we can be sure that we're not being unfair to fastsimbac, since it's doing less work than msprime, if anything.

How does that sound?

fbaumdicker commented 3 years ago

A question for you @fbaumdicker - if MaCS supports gene conversion, why did they bother making FastSimBac? Is there a paper where people do the theory for adding GC to the SMC, or is it a trivial change?

fastSimBac supports other features that MaCs does not. E.g. fixing the global/clonal genealogy from user input, external recombination (which is only relevant for bacteria), and a crossover recombination rate of zero. In MaCS you can only provide a crossover to gene-conversion ratio.

The MaCS paper describes the rational behind the gc conversion quite nicely. Also the fastSimBac paper has a step by step algorithm in the manuscript. I am not aware of any theoretical work on this.

At the moment I am even more confused by the results from MaCS as compared to fastSimBac. I tried to run a few simulations in MaCS, setting the crossover rate to a small value and the gc ratio to a large value, such that almost only gene conversion should occur. But although using a gene tract length of 1, I wasn't able to create a single ARG with an obvious gc event that spans only one base pair. Instead the trees look likey still only crossover recombinations are happening. Otherwise, I would have suggested to include MaCS to the comparison.

jeromekelleher commented 3 years ago

Thanks @fbaumdicker, that's very helpful.

Let's not bother with MaCS - it's known to have a number of issues (see the SCRM paper, for example).

I think we're good now with comparing SimBac and fastSimBac, I'll update here when I have a draft figure.

fbaumdicker commented 3 years ago

Let's not bother with MaCS - it's known to have a number of issues (see the SCRM paper, for example).

Ah, I didn't know that, thanks @jeromekelleher .

I think we're good now with comparing SimBac and fastSimBac, I'll update here when I have a draft figure.

It's the least wrong approach right now. ;)

jeromekelleher commented 3 years ago

Here's the current version of the GC plot @fbaumdicker - my plan is to run a couple of more points for fastsimbac and to cut the y axis off at 3600 (1 hour). I think that makes the point pretty convincingly, right?

gc-perf

fbaumdicker commented 3 years ago

Yes, indeed. We could also consider using a log scale to make the slow increase of msprime more visible and reduce the diameter of the largest white circle in the figure?

jeromekelleher commented 3 years ago

It would be better to use a log-scale as you say, but this is more concrete. You lose the sense of quite how much faster msprime is on a log scale.

fbaumdicker commented 3 years ago

That is totally true, and probably the strongest argument for the normal scale.

jeromekelleher commented 3 years ago

I think we can close this now, the basic figure is done.

tskit-dev / msprime-1.0-paper

Comparison with bacterial simulators #26