tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
170 stars 84 forks source link

Msprime slow for long genomes #1651

Open jeromekelleher opened 3 years ago

jeromekelleher commented 3 years ago

Msprime is much slower than MaCS when simulating long genomes/lots of recombination, as we've seen with the BosTau model in stdpopsim.

There may be some low-hanging fruit in making this faster as this regime hasn't really been optimised for.

gregorgorjanc commented 3 years ago

Here is a comparison of MaCS (as called from AlphaSimR in R) vs msprime (as called from stdpopsim in python). Both cases implement the Macleod et al (2013) Bos Taurus (cattle) model - see https://github.com/popsim-consortium/stdpopsim/tree/main/stdpopsim/catalog/BosTau. AlphaSimR is simulating 10^8 bp chromosomes with 1 Morgan length. So this is a reasonable chromosome. AlphaSimR call to MaCS can be seen at https://github.com/gaynorr/AlphaSimR/blob/master/R/founderPop.R (150-154) - there is 10% difference in the most recent Ne between AlphaSimR and stdpopsim, but that should not cause so different timings.

# install.packages("AlphaSimR")
library(AlphaSimR)
nRep = 10
timeDiff = rep(0, times = nRep)
for (Rep in 1:nRep) {
  start = Sys.time()
  out = runMacs(nInd = 5, nChr = 1, species = "CATTLE") # diploid by default
  end = Sys.time()
  timeDiff[Rep] = end - start
  print(timeDiff[Rep])
}
mean(timeDiff)
# 1.970797

and

import stdpopsim
import time
import numpy as np
species = stdpopsim.get_species("BosTau")
model = species.get_demographic_model("HolsteinFriesian_1M13")
contig = species.get_contig("11")
samples = model.get_samples(10)
engine = stdpopsim.get_engine("msprime")
nRep = 1
diffTime = np.zeros(nRep)
for Rep in range(nRep):
    start = time.time()
    ts = engine.simulate(model, contig, samples)
    end = time.time()
    diffTime[Rep] = end - start
    print(diffTime[Rep])

diffTime.mean()
# 503.74193811416626

I am seeing even slower times with dtwf (~3900. smc and smc_prime ran for even longer.

jeromekelleher commented 3 years ago

Just to clarify, you're simulating chr11 in the runMacs command above @gregorgorjanc? (It's not obvious from the nChr=1 bit).

jeromekelleher commented 3 years ago

(Ah, I see what you said above now about a 10^8bp chromosome - my bad)

gregorgorjanc commented 3 years ago

smc_prime 9979.277400016785 sec smc still running

jeromekelleher commented 3 years ago

So, I'm still not totally convinced that we're running the same simulations @gregorgorjanc - do you have any way of getting the number of trees output by MaCS from your simulations? Either that, or maybe a bare MaCS command so I can run it independently? (The number of trees is a good diagnostic for these things in my experience)

gregorgorjanc commented 3 years ago

@jeromekelleher I do hope I am not comparing apples to oranges. Here is the equivalent MaCS command to what I am running via AlphaSimR::runMacs(..., species="CATTLE"):

./macs 10 1e8 -t 9e-6 -r 3.6e-6 -eN 0.011 1.33 -eN 0.019 2.78 -eN 0.036 3.89 -eN 0.053 11.11 -eN 0.069 16.67 -eN 0.431 22.22 -eN 1.264 27.78 -eN 1.819 38.89 -eN 4.875 77.78 -eN 6.542 111.11 -eN 9.319 188.89 -eN 92.097 688.89 -eN 2592.097 688.89 2>trees.txt 1> sites.txt

This is how I came to scaled mutation and recombination rates and similarly for -eN flags

Ne = 90 mu = 2.5e-8 rho = 1.0e-8 theta = 4 Ne mu # 9e-6 r = 4 Ne rho # 3.6e-6

The stdpopsim implementation has slightly different values, but all of the same order! When I implemented the cattle demography in runMacs I eyeballed Ne change in the past MacLeod et al. (2013) - I didn't realise that they have a table in supplement :(

Let me know if I can provide any additional information!

jeromekelleher commented 3 years ago

Thanks @gregorgorjanc - I'm sure you've done it all right, I'm just surprised there's such a stark difference. My previous experience with MaCS was that it was far slower than msprime --- see this figure: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842

Maybe this was just in the boring human-ish parameter regime, though.

jeromekelleher commented 3 years ago

image

gregorgorjanc commented 3 years ago

Yeah, I had exactly this picture in mind when I decided to go with msprime/stdpopsim! I wonder if ag demography (decline in Ne over time) is favouring MaCS in any way?

gregorgorjanc commented 3 years ago

Here are some additional simple tests comparing MaCS and msprime alpha. I wanted to see where msprime would be faster (I have the above picture in mind), but in the below cases, it was always slower. Maybe because I am only sampling 10 haplotypes?

Where can I read on the new ways to define demography changes in msprime (so I can implement the comparable Ne changes in the past as I have done below with MaCS/ms -eN flags)?

Ne = 100
mu = 1.0e-8
rho = 1.0e-8
theta = 4 * Ne * mu # 4e-6
r = 4 * Ne * rho # 4e-6

time ./macs 10 1e6 -t 4e-6 -r 4e-6 # 0.009
time ./macs 10 1e6 -t 4e-5 -r 4e-5 # 0.013
time ./macs 10 1e6 -t 4e-4 -r 4e-4 # 0.019
time ./macs 10 1e6 -t 4e-3 -r 4e-3 # 0.135

time ./macs 10 1e7 -t 4e-6 -r 4e-6 # 0.011
time ./macs 10 1e7 -t 4e-5 -r 4e-5 # 0.022
time ./macs 10 1e7 -t 4e-4 -r 4e-4 # 0.117
time ./macs 10 1e7 -t 4e-3 -r 4e-3 # 1.202

time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 # 0.024
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 # 0.035
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 -eN 1 30 # 0.040
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 -eN 1 30 -eN 5 100 # 0.103
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 -eN 1 30 -eN 5 100 -eN 100 1000 # 0.201

-eN 0.025 10 -eN 0.5 20 -eN 1 30 -eN 5 100 -eN 100 1000
-eN t x
t = time in the past / 4Ne
x = scale change in Ne
10/(4*100)=0.025
200/(4*100)=0.5
400/(4*100)=1
2000/(4*100)=5
40000/(4*100)=100

import msprime
import time

start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e6, population_size=100)
end = time.time()
print(end - start)
# 0.03

start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e6, population_size=1000)
end = time.time()
print(end - start)
# 0.02

start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e6, population_size=10000)
end = time.time()
print(end - start)
# 0.38

start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e6, population_size=100000)
end = time.time()
print(end - start)
# 26

start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=100)
end = time.time()
print(end - start)
# 0.02

start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=1000)
end = time.time()
print(end - start)
# 0.42

start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=10000)
end = time.time()
print(end - start)
# 29

start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=100000)
end = time.time()
print(end - start)
# 7654
jeromekelleher commented 3 years ago

Apologies for the delay in getting back to you on this @gregorgorjanc!

It looks like bad news I'm afraid, this is actually a case where the SMC can be much faster than msprime's implementation. Let's take a look using some debugging tools

import daiquiri

daiquiri.setup(level="DEBUG")

sim = msprime.ancestry._parse_sim_ancestry(
    samples=10,
    recombination_rate=1e-8,
    sequence_length=10e7,
    population_size=100000,
    end_time=5000,
)

sim.run(event_chunk=1_000_000)

Giving

2021-06-04 14:24:11,271 [178477] INFO     msprime.ancestry: Sampling 10 individuals with ploidy 2 in population 0 (name='pop_0') at time 0
2021-06-04 14:24:11,274 [178477] INFO     msprime.ancestry: model[0] {'name': 'hudson'} started at time=0 nodes=20 edges=0
2021-06-04 14:24:11,274 [178477] INFO     msprime.ancestry: Running model {'name': 'hudson'} until max time: 5000.000000
2021-06-04 14:24:11,992 [178477] DEBUG    msprime.ancestry: time=1567.76 ancestors=16546 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:12,759 [178477] DEBUG    msprime.ancestry: time=2191.24 ancestors=19026 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:13,558 [178477] DEBUG    msprime.ancestry: time=2702.77 ancestors=20480 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:14,384 [178477] DEBUG    msprime.ancestry: time=3161.93 ancestors=21272 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:15,239 [178477] DEBUG    msprime.ancestry: time=3589.22 ancestors=21982 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:16,099 [178477] DEBUG    msprime.ancestry: time=3993.26 ancestors=22542 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:16,976 [178477] DEBUG    msprime.ancestry: time=4379.21 ancestors=23078 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:17,876 [178477] DEBUG    msprime.ancestry: time=4753.34 ancestors=23198 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:18,506 [178477] DEBUG    msprime.ancestry: time=5000 ancestors=23222 ret=ExitReason.MAX_TIME
2021-06-04 14:24:18,506 [178477] DEBUG    msprime.ancestry: Skipping remaining 0 models
2021-06-04 14:24:18,523 [178477] INFO     msprime.ancestry: Completed at time=5000 nodes=38441 edges=139253

This is a big simulation. Each one of these log lines corresponds to a million events, which takes about a second each. I've only simulated back 5000 generations here, but we can see that it's going to take a lot of simulating to do this for the full coalescent.

OK, so let's try the SMC:

import msprime
import daiquiri

daiquiri.setup(level="DEBUG")

sim = msprime.ancestry._parse_sim_ancestry(
    samples=10,
    recombination_rate=1e-8,
    sequence_length=10e7,
    population_size=100000,
    end_time=5000,
    model="smc",
)

sim.run(event_chunk=1_000_000)
print("Rejected CA:", sim.num_rejected_common_ancestor_events)
print("Actual CA:", sim.num_common_ancestor_events)
print("Actual RE:", sim.num_recombination_events)

We also add some more debugging info here to tell us how much of an effect using the SMC has. Msprime's SMC implementation works by rejection sampling common ancestor events, depending on whether they meet the SMC criteria or not. We get:

2021-06-04 14:27:19,150 [178637] INFO     msprime.ancestry: Sampling 10 individuals with ploidy 2 in population 0 (name='pop_0') at time 0
2021-06-04 14:27:19,152 [178637] INFO     msprime.ancestry: model[0] {'name': 'smc'} started at time=0 nodes=20 edges=0
2021-06-04 14:27:19,152 [178637] INFO     msprime.ancestry: Running model {'name': 'smc'} until max time: 5000.000000
2021-06-04 14:27:20,131 [178637] DEBUG    msprime.ancestry: time=1504.4 ancestors=27156 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:21,187 [178637] DEBUG    msprime.ancestry: time=1928.22 ancestors=33850 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:22,280 [178637] DEBUG    msprime.ancestry: time=2233.57 ancestors=38336 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:23,399 [178637] DEBUG    msprime.ancestry: time=2481.33 ancestors=41843 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:24,553 [178637] DEBUG    msprime.ancestry: time=2693.6 ancestors=44836 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:25,722 [178637] DEBUG    msprime.ancestry: time=2880.94 ancestors=47432 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:26,915 [178637] DEBUG    msprime.ancestry: time=3049.64 ancestors=49729 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:28,134 [178637] DEBUG    msprime.ancestry: time=3204.35 ancestors=51770 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:29,378 [178637] DEBUG    msprime.ancestry: time=3347.91 ancestors=53618 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:30,692 [178637] DEBUG    msprime.ancestry: time=3482.47 ancestors=55283 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:32,007 [178637] DEBUG    msprime.ancestry: time=3609.35 ancestors=56913 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:33,457 [178637] DEBUG    msprime.ancestry: time=3729.58 ancestors=58308 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:34,798 [178637] DEBUG    msprime.ancestry: time=3844.21 ancestors=59664 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:36,182 [178637] DEBUG    msprime.ancestry: time=3953.7 ancestors=60933 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:37,617 [178637] DEBUG    msprime.ancestry: time=4059.08 ancestors=62152 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:39,056 [178637] DEBUG    msprime.ancestry: time=4160.56 ancestors=63326 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:40,442 [178637] DEBUG    msprime.ancestry: time=4258.33 ancestors=64475 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:41,939 [178637] DEBUG    msprime.ancestry: time=4352.76 ancestors=65566 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:43,402 [178637] DEBUG    msprime.ancestry: time=4444.26 ancestors=66584 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:44,912 [178637] DEBUG    msprime.ancestry: time=4533.23 ancestors=67500 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:46,425 [178637] DEBUG    msprime.ancestry: time=4619.64 ancestors=68512 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:47,963 [178637] DEBUG    msprime.ancestry: time=4703.58 ancestors=69398 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:49,514 [178637] DEBUG    msprime.ancestry: time=4785.48 ancestors=70250 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:51,009 [178637] DEBUG    msprime.ancestry: time=4865.38 ancestors=71091 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:52,457 [178637] DEBUG    msprime.ancestry: time=4943.52 ancestors=71874 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:53,613 [178637] DEBUG    msprime.ancestry: time=5000 ancestors=72453 ret=ExitReason.MAX_TIME
2021-06-04 14:27:53,613 [178637] DEBUG    msprime.ancestry: Skipping remaining 0 models
2021-06-04 14:27:53,629 [178637] INFO     msprime.ancestry: Completed at time=5000 nodes=89294 edges=139522
Rejected CA: 25630488
Actual CA: 16821
Actual RE: 89254

So the SMC model requires a lot less work to simulate 5000 generations but we are rejection sampling a massive number of events.

So, unfortunately, I think we'd need a more efficient SMC implementation to make simulating these large models feasible. :frowning_face:

I'm sure this is possible, but it would certainly be significant work.

petrelharp commented 3 years ago

I suppose a lot of the work is because of distant chunks of ancestry coalescing together and then recombining apart again very quickly? We could avoid this by only letting 'nearby' chunks coalesce; thus getting an approximation that'd be kinda like SMC (no long-range correlations beyond whatever 'nearby' means), and scales linearly with chromosome length. But, it sounds like a lot of fuss to implement.

jeromekelleher commented 3 years ago

We probably have the information needed lying around if we think about it hard enough, but yes, it's not going to be easy.

gregorgorjanc commented 3 years ago

Thanks for looking into this. So, there does not seem to be a useful (time-efficient) way to simulate whole-chromosomes with msprime? How did you then get the image posted at https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842#pcbi-1004842-g003 - I am specifically looking at the left part where msprime at most needed 150sec for 100 Megabases = 100 * 10^6 bases = 10^8 bases ~ average mammalian chromosome? Reading that figure caption again I see msprime was run in SMC' mode. I am just confused how your MaCS times were much larger, while my experience is the other way around. Has SMC' algo changed?

jeromekelleher commented 3 years ago

The issue is the population size @gregorgorjanc - we did those simulations with Ne=10^4, ~humans. The simulations are the full coalescent, not the SMC. We knew that the full coalescent is quadratic in sequence length, but the coefficient is small enough to be smaller than the linear growth of MaCS (which has very large coefficients). I didn't take into account how having a much larger Ne would affect this though, which pushes us much further along the quadratic curve. So - my bad really, I should have had another dimension in those plots for Ne.

gregorgorjanc commented 3 years ago

Hmm, but I wanted to simulate Ne=100, albeit with increasing past Ne. Still, If I look at the right panel of https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842#pcbi-1004842-g003 I see that simulating ~half a chromosome for 100K samples at Ne=10^4 is <100 sec - looks very doable. Would increasing the past Ne change running times a lot?

Let me run an additional test, but can you help me convert the second MaCS below into the msprime call?

Thanks!

# MaCS with the base Ne=100
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 # 0.024

# equivalent msprime with the base Ne=100
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=100)
end = time.time()
print(end - start)
# 0.02

# MaCS with the base Ne=100 with increasing Ne in the past
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 -eN 1 30 -eN 5 100 -eN 100 1000 # 0.201

# equivalent msprime with the base Ne=100 with increasing Ne in the past
# TODO
jeromekelleher commented 3 years ago

I'm not much good at reading these Macs command lines @gregorgorjanc, when does the population size increase and to what size? Should be easy to describe using the Demography API?

d = msprime.Demopaphy()
d.add_population("cattle", initial_size=100)
d.add_population_parameters_change(time=X, population="cattle", initial_size=new_Ne)
# Check it makes sense:
print(d.debug())
ts = msprime.sim_ancestry(10, demography=d,...)

But to answer your question, msprime scales very well with sample size, but not so well with scaled recombination rate. Sample size really has very little impact on the run time.