Open jeromekelleher opened 3 years ago
Here is a comparison of MaCS (as called from AlphaSimR in R) vs msprime (as called from stdpopsim in python). Both cases implement the Macleod et al (2013) Bos Taurus (cattle) model - see https://github.com/popsim-consortium/stdpopsim/tree/main/stdpopsim/catalog/BosTau. AlphaSimR is simulating 10^8 bp chromosomes with 1 Morgan length. So this is a reasonable chromosome. AlphaSimR call to MaCS can be seen at https://github.com/gaynorr/AlphaSimR/blob/master/R/founderPop.R (150-154) - there is 10% difference in the most recent Ne between AlphaSimR and stdpopsim, but that should not cause so different timings.
# install.packages("AlphaSimR")
library(AlphaSimR)
nRep = 10
timeDiff = rep(0, times = nRep)
for (Rep in 1:nRep) {
start = Sys.time()
out = runMacs(nInd = 5, nChr = 1, species = "CATTLE") # diploid by default
end = Sys.time()
timeDiff[Rep] = end - start
print(timeDiff[Rep])
}
mean(timeDiff)
# 1.970797
and
import stdpopsim
import time
import numpy as np
species = stdpopsim.get_species("BosTau")
model = species.get_demographic_model("HolsteinFriesian_1M13")
contig = species.get_contig("11")
samples = model.get_samples(10)
engine = stdpopsim.get_engine("msprime")
nRep = 1
diffTime = np.zeros(nRep)
for Rep in range(nRep):
start = time.time()
ts = engine.simulate(model, contig, samples)
end = time.time()
diffTime[Rep] = end - start
print(diffTime[Rep])
diffTime.mean()
# 503.74193811416626
I am seeing even slower times with dtwf (~3900. smc and smc_prime ran for even longer.
Just to clarify, you're simulating chr11 in the runMacs
command above @gregorgorjanc? (It's not obvious from the nChr=1
bit).
(Ah, I see what you said above now about a 10^8bp chromosome - my bad)
smc_prime 9979.277400016785 sec smc still running
So, I'm still not totally convinced that we're running the same simulations @gregorgorjanc - do you have any way of getting the number of trees output by MaCS from your simulations? Either that, or maybe a bare MaCS command so I can run it independently? (The number of trees is a good diagnostic for these things in my experience)
@jeromekelleher I do hope I am not comparing apples to oranges. Here is the equivalent MaCS command to what I am running via AlphaSimR::runMacs(..., species="CATTLE")
:
./macs 10 1e8 -t 9e-6 -r 3.6e-6 -eN 0.011 1.33 -eN 0.019 2.78 -eN 0.036 3.89 -eN 0.053 11.11 -eN 0.069 16.67 -eN 0.431 22.22 -eN 1.264 27.78 -eN 1.819 38.89 -eN 4.875 77.78 -eN 6.542 111.11 -eN 9.319 188.89 -eN 92.097 688.89 -eN 2592.097 688.89 2>trees.txt 1> sites.txt
This is how I came to scaled mutation and recombination rates and similarly for -eN flags
Ne = 90 mu = 2.5e-8 rho = 1.0e-8 theta = 4 Ne mu # 9e-6 r = 4 Ne rho # 3.6e-6
The stdpopsim implementation has slightly different values, but all of the same order! When I implemented the cattle demography in runMacs I eyeballed Ne change in the past MacLeod et al. (2013) - I didn't realise that they have a table in supplement :(
Let me know if I can provide any additional information!
Thanks @gregorgorjanc - I'm sure you've done it all right, I'm just surprised there's such a stark difference. My previous experience with MaCS was that it was far slower than msprime --- see this figure: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842
Maybe this was just in the boring human-ish parameter regime, though.
Yeah, I had exactly this picture in mind when I decided to go with msprime/stdpopsim! I wonder if ag demography (decline in Ne over time) is favouring MaCS in any way?
Here are some additional simple tests comparing MaCS and msprime alpha. I wanted to see where msprime would be faster (I have the above picture in mind), but in the below cases, it was always slower. Maybe because I am only sampling 10 haplotypes?
Where can I read on the new ways to define demography changes in msprime (so I can implement the comparable Ne changes in the past as I have done below with MaCS/ms -eN flags)?
Ne = 100
mu = 1.0e-8
rho = 1.0e-8
theta = 4 * Ne * mu # 4e-6
r = 4 * Ne * rho # 4e-6
time ./macs 10 1e6 -t 4e-6 -r 4e-6 # 0.009
time ./macs 10 1e6 -t 4e-5 -r 4e-5 # 0.013
time ./macs 10 1e6 -t 4e-4 -r 4e-4 # 0.019
time ./macs 10 1e6 -t 4e-3 -r 4e-3 # 0.135
time ./macs 10 1e7 -t 4e-6 -r 4e-6 # 0.011
time ./macs 10 1e7 -t 4e-5 -r 4e-5 # 0.022
time ./macs 10 1e7 -t 4e-4 -r 4e-4 # 0.117
time ./macs 10 1e7 -t 4e-3 -r 4e-3 # 1.202
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 # 0.024
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 # 0.035
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 -eN 1 30 # 0.040
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 -eN 1 30 -eN 5 100 # 0.103
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 -eN 1 30 -eN 5 100 -eN 100 1000 # 0.201
-eN 0.025 10 -eN 0.5 20 -eN 1 30 -eN 5 100 -eN 100 1000
-eN t x
t = time in the past / 4Ne
x = scale change in Ne
10/(4*100)=0.025
200/(4*100)=0.5
400/(4*100)=1
2000/(4*100)=5
40000/(4*100)=100
import msprime
import time
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e6, population_size=100)
end = time.time()
print(end - start)
# 0.03
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e6, population_size=1000)
end = time.time()
print(end - start)
# 0.02
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e6, population_size=10000)
end = time.time()
print(end - start)
# 0.38
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e6, population_size=100000)
end = time.time()
print(end - start)
# 26
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=100)
end = time.time()
print(end - start)
# 0.02
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=1000)
end = time.time()
print(end - start)
# 0.42
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=10000)
end = time.time()
print(end - start)
# 29
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=100000)
end = time.time()
print(end - start)
# 7654
Apologies for the delay in getting back to you on this @gregorgorjanc!
It looks like bad news I'm afraid, this is actually a case where the SMC can be much faster than msprime's implementation. Let's take a look using some debugging tools
import daiquiri
daiquiri.setup(level="DEBUG")
sim = msprime.ancestry._parse_sim_ancestry(
samples=10,
recombination_rate=1e-8,
sequence_length=10e7,
population_size=100000,
end_time=5000,
)
sim.run(event_chunk=1_000_000)
Giving
2021-06-04 14:24:11,271 [178477] INFO msprime.ancestry: Sampling 10 individuals with ploidy 2 in population 0 (name='pop_0') at time 0
2021-06-04 14:24:11,274 [178477] INFO msprime.ancestry: model[0] {'name': 'hudson'} started at time=0 nodes=20 edges=0
2021-06-04 14:24:11,274 [178477] INFO msprime.ancestry: Running model {'name': 'hudson'} until max time: 5000.000000
2021-06-04 14:24:11,992 [178477] DEBUG msprime.ancestry: time=1567.76 ancestors=16546 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:12,759 [178477] DEBUG msprime.ancestry: time=2191.24 ancestors=19026 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:13,558 [178477] DEBUG msprime.ancestry: time=2702.77 ancestors=20480 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:14,384 [178477] DEBUG msprime.ancestry: time=3161.93 ancestors=21272 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:15,239 [178477] DEBUG msprime.ancestry: time=3589.22 ancestors=21982 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:16,099 [178477] DEBUG msprime.ancestry: time=3993.26 ancestors=22542 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:16,976 [178477] DEBUG msprime.ancestry: time=4379.21 ancestors=23078 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:17,876 [178477] DEBUG msprime.ancestry: time=4753.34 ancestors=23198 ret=ExitReason.MAX_EVENTS
2021-06-04 14:24:18,506 [178477] DEBUG msprime.ancestry: time=5000 ancestors=23222 ret=ExitReason.MAX_TIME
2021-06-04 14:24:18,506 [178477] DEBUG msprime.ancestry: Skipping remaining 0 models
2021-06-04 14:24:18,523 [178477] INFO msprime.ancestry: Completed at time=5000 nodes=38441 edges=139253
This is a big simulation. Each one of these log lines corresponds to a million events, which takes about a second each. I've only simulated back 5000 generations here, but we can see that it's going to take a lot of simulating to do this for the full coalescent.
OK, so let's try the SMC:
import msprime
import daiquiri
daiquiri.setup(level="DEBUG")
sim = msprime.ancestry._parse_sim_ancestry(
samples=10,
recombination_rate=1e-8,
sequence_length=10e7,
population_size=100000,
end_time=5000,
model="smc",
)
sim.run(event_chunk=1_000_000)
print("Rejected CA:", sim.num_rejected_common_ancestor_events)
print("Actual CA:", sim.num_common_ancestor_events)
print("Actual RE:", sim.num_recombination_events)
We also add some more debugging info here to tell us how much of an effect using the SMC has. Msprime's SMC implementation works by rejection sampling common ancestor events, depending on whether they meet the SMC criteria or not. We get:
2021-06-04 14:27:19,150 [178637] INFO msprime.ancestry: Sampling 10 individuals with ploidy 2 in population 0 (name='pop_0') at time 0
2021-06-04 14:27:19,152 [178637] INFO msprime.ancestry: model[0] {'name': 'smc'} started at time=0 nodes=20 edges=0
2021-06-04 14:27:19,152 [178637] INFO msprime.ancestry: Running model {'name': 'smc'} until max time: 5000.000000
2021-06-04 14:27:20,131 [178637] DEBUG msprime.ancestry: time=1504.4 ancestors=27156 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:21,187 [178637] DEBUG msprime.ancestry: time=1928.22 ancestors=33850 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:22,280 [178637] DEBUG msprime.ancestry: time=2233.57 ancestors=38336 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:23,399 [178637] DEBUG msprime.ancestry: time=2481.33 ancestors=41843 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:24,553 [178637] DEBUG msprime.ancestry: time=2693.6 ancestors=44836 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:25,722 [178637] DEBUG msprime.ancestry: time=2880.94 ancestors=47432 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:26,915 [178637] DEBUG msprime.ancestry: time=3049.64 ancestors=49729 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:28,134 [178637] DEBUG msprime.ancestry: time=3204.35 ancestors=51770 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:29,378 [178637] DEBUG msprime.ancestry: time=3347.91 ancestors=53618 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:30,692 [178637] DEBUG msprime.ancestry: time=3482.47 ancestors=55283 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:32,007 [178637] DEBUG msprime.ancestry: time=3609.35 ancestors=56913 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:33,457 [178637] DEBUG msprime.ancestry: time=3729.58 ancestors=58308 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:34,798 [178637] DEBUG msprime.ancestry: time=3844.21 ancestors=59664 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:36,182 [178637] DEBUG msprime.ancestry: time=3953.7 ancestors=60933 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:37,617 [178637] DEBUG msprime.ancestry: time=4059.08 ancestors=62152 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:39,056 [178637] DEBUG msprime.ancestry: time=4160.56 ancestors=63326 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:40,442 [178637] DEBUG msprime.ancestry: time=4258.33 ancestors=64475 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:41,939 [178637] DEBUG msprime.ancestry: time=4352.76 ancestors=65566 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:43,402 [178637] DEBUG msprime.ancestry: time=4444.26 ancestors=66584 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:44,912 [178637] DEBUG msprime.ancestry: time=4533.23 ancestors=67500 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:46,425 [178637] DEBUG msprime.ancestry: time=4619.64 ancestors=68512 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:47,963 [178637] DEBUG msprime.ancestry: time=4703.58 ancestors=69398 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:49,514 [178637] DEBUG msprime.ancestry: time=4785.48 ancestors=70250 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:51,009 [178637] DEBUG msprime.ancestry: time=4865.38 ancestors=71091 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:52,457 [178637] DEBUG msprime.ancestry: time=4943.52 ancestors=71874 ret=ExitReason.MAX_EVENTS
2021-06-04 14:27:53,613 [178637] DEBUG msprime.ancestry: time=5000 ancestors=72453 ret=ExitReason.MAX_TIME
2021-06-04 14:27:53,613 [178637] DEBUG msprime.ancestry: Skipping remaining 0 models
2021-06-04 14:27:53,629 [178637] INFO msprime.ancestry: Completed at time=5000 nodes=89294 edges=139522
Rejected CA: 25630488
Actual CA: 16821
Actual RE: 89254
So the SMC model requires a lot less work to simulate 5000 generations but we are rejection sampling a massive number of events.
So, unfortunately, I think we'd need a more efficient SMC implementation to make simulating these large models feasible. :frowning_face:
I'm sure this is possible, but it would certainly be significant work.
I suppose a lot of the work is because of distant chunks of ancestry coalescing together and then recombining apart again very quickly? We could avoid this by only letting 'nearby' chunks coalesce; thus getting an approximation that'd be kinda like SMC (no long-range correlations beyond whatever 'nearby' means), and scales linearly with chromosome length. But, it sounds like a lot of fuss to implement.
We probably have the information needed lying around if we think about it hard enough, but yes, it's not going to be easy.
Thanks for looking into this. So, there does not seem to be a useful (time-efficient) way to simulate whole-chromosomes with msprime? How did you then get the image posted at https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842#pcbi-1004842-g003 - I am specifically looking at the left part where msprime at most needed 150sec for 100 Megabases = 100 * 10^6 bases = 10^8 bases ~ average mammalian chromosome? Reading that figure caption again I see msprime was run in SMC' mode. I am just confused how your MaCS times were much larger, while my experience is the other way around. Has SMC' algo changed?
The issue is the population size @gregorgorjanc - we did those simulations with Ne=10^4, ~humans. The simulations are the full coalescent, not the SMC. We knew that the full coalescent is quadratic in sequence length, but the coefficient is small enough to be smaller than the linear growth of MaCS (which has very large coefficients). I didn't take into account how having a much larger Ne would affect this though, which pushes us much further along the quadratic curve. So - my bad really, I should have had another dimension in those plots for Ne.
Hmm, but I wanted to simulate Ne=100, albeit with increasing past Ne. Still, If I look at the right panel of https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004842#pcbi-1004842-g003 I see that simulating ~half a chromosome for 100K samples at Ne=10^4 is <100 sec - looks very doable. Would increasing the past Ne change running times a lot?
Let me run an additional test, but can you help me convert the second MaCS below into the msprime call?
Thanks!
# MaCS with the base Ne=100
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 # 0.024
# equivalent msprime with the base Ne=100
start = time.time()
ts = msprime.sim_ancestry(samples=10, recombination_rate=1e-8, sequence_length=10e7, population_size=100)
end = time.time()
print(end - start)
# 0.02
# MaCS with the base Ne=100 with increasing Ne in the past
time ./macs 10 1e7 -t 4e-6 -r 4e-6 -eN 0.025 10 -eN 0.5 20 -eN 1 30 -eN 5 100 -eN 100 1000 # 0.201
# equivalent msprime with the base Ne=100 with increasing Ne in the past
# TODO
I'm not much good at reading these Macs command lines @gregorgorjanc, when does the population size increase and to what size? Should be easy to describe using the Demography API?
d = msprime.Demopaphy()
d.add_population("cattle", initial_size=100)
d.add_population_parameters_change(time=X, population="cattle", initial_size=new_Ne)
# Check it makes sense:
print(d.debug())
ts = msprime.sim_ancestry(10, demography=d,...)
But to answer your question, msprime scales very well with sample size, but not so well with scaled recombination rate. Sample size really has very little impact on the run time.
Msprime is much slower than MaCS when simulating long genomes/lots of recombination, as we've seen with the BosTau model in stdpopsim.
There may be some low-hanging fruit in making this faster as this regime hasn't really been optimised for.