tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
170 stars 84 forks source link

issue with replicates using "num_replicates" #2249

Closed yzliu01 closed 6 months ago

yzliu01 commented 6 months ago

Dear GertjanBisschop and all,

I am writing to ask if you can advise me on this problem I got using num_replicates in the following codes. Error I got with mts_5rep = msprime.sim_mutations(ts_5rep, rate=8.9e-8, random_seed=1) is ValueError: First argument must be a TreeSequence instance.. I know there might be a problem in the previous step ts_5rep = msprime.sim_ancestry() but I could not figure it out since it did not report anything with my codes below.

Another question. Here

demography2 = msprime.Demography()
demography2.add_population(name="A", initial_size=10000)
demography2.add_population_parameters_change(time=0.1, growth_rate=0.0007, population="A")
demography2.add_population_parameters_change(time=1000, growth_rate=0, population="A")

How can I set the current time with a growth_rate as the beginning time? I tried time=0 and this did not work. I had to use a non-zero number (0.1 etc.).

Thank you very much in advance. Best,

Used codes

# load program
import tskit
import msprime
import numpy
import random
import math

# two events (positive - recent expansion)
demography2 = msprime.Demography()
demography2.add_population(name="A", initial_size=10000)
demography2.add_population_parameters_change(time=0.1, growth_rate=0.0007, population="A")
demography2.add_population_parameters_change(time=1000, growth_rate=0, population="A")
demography2.debug()

# 5 replicates
r_chrom = 8.9e-8
r_break = math.log(2)
chrom_positions = [0, 1.5e7, 3e7, 4.5e7, 6e7, 7.5e7, 9e7, 10.5e7, 12e7, 13.5e7, 15e7, 16.5e7, 18e7, 19.5e7, 21e7, 22.5e7, 24e7, 25.5e7, 27e7, 28.5e7, 30e7]
map_positions = [
    chrom_positions[0],
    chrom_positions[1],
    chrom_positions[1] + 1,
    chrom_positions[2],
    chrom_positions[2] + 1,
    chrom_positions[3],
    chrom_positions[3] + 1,
    chrom_positions[4],
    chrom_positions[4] + 1,
    chrom_positions[5],
    chrom_positions[5] + 1,
    chrom_positions[6],
    chrom_positions[6] + 1,
    chrom_positions[7],
    chrom_positions[7] + 1,
    chrom_positions[8],
    chrom_positions[8] + 1,
    chrom_positions[9],
    chrom_positions[9] + 1,
    chrom_positions[10],
    chrom_positions[10] + 1,
    chrom_positions[11],
    chrom_positions[11] + 1,
    chrom_positions[12],
    chrom_positions[12] + 1,
    chrom_positions[13],
    chrom_positions[13] + 1,
    chrom_positions[14],
    chrom_positions[14] + 1,
    chrom_positions[15],
    chrom_positions[15] + 1,
    chrom_positions[16],
    chrom_positions[16] + 1,
    chrom_positions[17],
    chrom_positions[17] + 1,
    chrom_positions[18],
    chrom_positions[18] + 1,
    chrom_positions[19],
    chrom_positions[19] + 1,
    chrom_positions[20]
]
rates = [r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom, r_break, r_chrom]
rate_map = msprime.RateMap(position=map_positions, rate=rates)

# set parameters
ts_5rep = msprime.sim_ancestry(
    samples=10,
    #population_size=10000, # specified in "demography.add_population"
    recombination_rate=rate_map,
    start_time=1,
    #sequence_length=1.5e7,
    demography=demography2,
    ploidy=2,
    num_replicates=5,
    model=[
        msprime.DiscreteTimeWrightFisher(duration=20),
        msprime.StandardCoalescent(),
    ],
    random_seed=1234,
)

# generate sim
mts_5rep = msprime.sim_mutations(ts_5rep, rate=8.9e-8, random_seed=1)

Error I got with mts_5rep = msprime.sim_mutations(ts_5rep, rate=8.9e-8, random_seed=1). ValueError: First argument must be a TreeSequence instance.

Other codes after the above steps.

# function
def write_vcf_dip(mts_5rep, output):
    with open(output, "w") as vcf_file:
        mts.write_vcf(vcf_file)

# save vcf file
write_vcf_dip(mts_5rep, "one_pop_msprime_sim_1_rep_20chr_mu1e_re89e.vcf")
molpopgen commented 6 months ago

When using num_replicates, you get an iterator over tree sequences rather than, say, a list of simulated outputs. You must consume the iterator in order to start adding mutations:

import msprime

for ts in msprime.sim_ancestry(10, num_replicates=5):
    mts = msprime.sim_mutations(ts, rate=1e-4)
molpopgen commented 6 months ago

Regarding the question about demography, it is probably easier to write the demographic model in demes format. To get a growth rate at time 0, you write the most recent epoch for a deme with start_size != end_size. There are examples here

yzliu01 commented 6 months ago

Regarding the question about demography, it is probably easier to write the demographic model in demes format. To get a growth rate at time 0, you write the most recent epoch for a deme with start_size != end_size. There are examples here

Hi, many thanks. It's great to know this method. It seems I need to create yaml file? I am not so clear how to incorporate this deme format in my msprime codes. Can you post a simple example?

GertjanBisschop commented 6 months ago

You can convert the demes yaml file like this:

graph = demes.load("model.yaml")
demography = msprime.Demography.from_demes(graph)

(see here) and then run your simulation as before: ts = msprime.sim_ancestry(*, demography=demography)