popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
121 stars 86 forks source link

No variants are simulated in zero-recombination region of chr22 #1422

Open nspope opened 1 year ago

nspope commented 1 year ago

For example:

import stdpopsim
import numpy as np

# simulate first 20Mb of chr22

for mapname in ["HapMapII_GRCh37", "HapMapII_GRCh38"]:
    sp = stdpopsim.get_species("HomSap")
    chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
    constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
    breaks = np.linspace(0, chr22.length, 7)
    breaks_mb = np.round(breaks/1e6, 1)

    slim = stdpopsim.get_engine("slim")
    ts_slim = slim.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4)
    div_slim = ts_slim.segregating_sites(windows=breaks)
    for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
        print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")

    msp = stdpopsim.get_engine("msprime")
    ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
    div_msp = ts_msp.segregating_sites(windows=breaks)
    for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
        print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")

gives

segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 3.059999999999999e-05
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 0.00018810000000000007
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 3.36e-05
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 0.00020310000000000008
segsites slim/HapMapII_GRCh38 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh38 [13.3-16.7 Mb]: 8.189999999999998e-05
segsites slim/HapMapII_GRCh38 [16.7-20.0 Mb]: 0.00017460000000000007
segsites msp/HapMapII_GRCh38 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh38 [13.3-16.7 Mb]: 8.789999999999998e-05
segsites msp/HapMapII_GRCh38 [16.7-20.0 Mb]: 0.00019890000000000006
nspope commented 1 year ago

Looks like that's because this region is masked ("blank" trees) in the output tree sequence for both engines.

The hapmap files for both genetic maps start at ~15 Mb, so I assume that to fix this we'd need to add a line at the beginning specifying a 0 rate (so this region is not treated as "missing data")?

nspope commented 1 year ago

related to #1387

mufernando commented 1 year ago

To get to the bottom of this I decided to look at branch stats instead.

With the SLiM engine, we get non-zero segsites for the first few windows when _recap_and_recale=False, but these we go to 0 when _recap_and_rescale=True. With the msprime engine, they are always zero.

>>> mapname="HapMapII_GRCh37"
>>> sp = stdpopsim.get_species("HomSap")
>>> chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
>>> constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
>>> breaks = np.linspace(0, chr22.length, 7)
>>> breaks_mb = np.round(breaks/1e6, 1)
>>>
>>> eng = stdpopsim.get_engine("slim")
>>> ts_slim = eng.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4, _recap_and_rescale=False)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/slim_engine.py:1523: SLiMScalingFactorWarning: You're using a scaling factor (4). This should give similar results for many situations, but is not equivalent, especially in the presence of selection. When using rescaling, you should be careful---do checks and compare results across different values of the scaling factor.
  warnings.warn(

>>> div_slim = ts_slim.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
...     print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
...
segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 7944.9128727
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 7531.1644341
>>> msp = stdpopsim.get_engine("msprime")
>>> ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
>>> div_msp = ts_msp.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
...     print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")
...
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 3081.1006886320965
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 12736.400180715054
>>> mapname="HapMapII_GRCh37"
>>> sp = stdpopsim.get_species("HomSap")
>>> chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/genetic_maps.py:127: UserWarning: Recombination map has length 51229805.0, which is longer than chromosome length 50818468. The latter will be used.
  warnings.warn(
>>> constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
>>> breaks = np.linspace(0, chr22.length, 7)
>>> breaks_mb = np.round(breaks/1e6, 1)
>>>
>>> eng = stdpopsim.get_engine("slim")
>>> ts_slim = eng.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/slim_engine.py:1523: SLiMScalingFactorWarning: You're using a scaling factor (4). This should give similar results for many situations, but is not equivalent, especially in the presence of selection. When using rescaling, you should be careful---do checks and compare results across different values of the scaling factor.
  warnings.warn(
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/msprime/ancestry.py:831: TimeUnitsMismatchWarning: The initial_state has time_units=ticks but time is measured in generations in msprime. This may lead to significant discrepancies between the timescales. If you wish to suppress this warning, you can use, e.g., warnings.simplefilter('ignore', msprime.TimeUnitsMismatchWarning)
  warnings.warn(message, TimeUnitsMismatchWarning)
>>> div_slim = ts_slim.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
...     print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
...
segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 2195.502231990335
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 13879.990809453386
>>> msp = stdpopsim.get_engine("msprime")
>>> ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
>>> div_msp = ts_msp.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
...     print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")
...
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 2172.3004433208343
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 14758.390178399126
mufernando commented 1 year ago

So it looks like msprime is at fault here --- it is removing the trees where rec rate is 0.

nspope commented 1 year ago

wouldn't it be where the rec rate is nan, not 0?

mufernando commented 1 year ago

oops, yes, that is what I meant!

nspope commented 1 year ago

Missing data in RateMap is documented here ... I think the issue is with our chr22 maps (which should explicitly mark the recombination rate as 0).

jeromekelleher commented 1 year ago

Trimming out the trees with missing data on the flanks is definitely by design, as we've had a number of cases where these large zero-recombination rate bits flanking a chromosome have really messed up statistics.

Do we really know that the start of chr22 has a recombination rate of 0, or is it just that there's not much data there because of sequence weirdness?

nspope commented 1 year ago

Weirdness is probably the answer @jeromekelleher -- looks like the short arm of chr22 (0 to ~15Mb) contains many repetitive regions that are homologous to (and may recombine with) parts of other chromosomes.

One thing that confuses me with these maps is that they invariably contain missing data at the start, but 0 recombination rate regions at the end. For example,

import stdpopsim

homsap = stdpopsim.get_species("HomSap")
chr22 = homsap.get_contig("chr22", genetic_map="HapMapII_GRCh38")
engine = stdpopsim.get_engine("msprime")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(contig=chr22, demographic_model=model, samples={"pop_0":2})

print(chr22.length)
# 50818468.0

print(chr22.recombination_map.asdict())
# {'position': array([       0., 15287921., 15295878., ..., 50785209., 50791377., 50818468.]), 
#  'rate': array([        nan, 6.74364e-09, 4.31139e-09, ..., 6.48707e-09, 6.55239e-09, 0.00000e+00])}

print([sim.first().num_edges, sim.last().num_edges])
# [0, 6]

print(list(sim.breakpoints())[-5:])
# [50714355.0, 50756801.0, 50760002.0, 50779378.0, 50818468.0]

so, the last bin (from 50791377 to 50818468) is not masked but instead simulated with a zero recombination rate, which results in a single tree that spans from 50779378 to the chromosome end in this simulation. In contrast the first bin is missing so contains "empty" trees in the output.

All the chromosomes are like this-- just seems odd to me that the chromosome start/end aren't treated the same?

petrelharp commented 1 year ago

That does sound like we're not treating the ends consistently, and should mask the ends as well. A topic for the next stdpopsim meeting next week?

jeromekelleher commented 1 year ago

Agreed, sounds like the ends aren't being handled properly.

nspope commented 1 year ago

Here is where the problem lies:

https://github.com/popsim-consortium/stdpopsim/blob/a229273d4c30f08dba6ffb003cb2dbe023a7132f/stdpopsim/genetic_maps.py#L120-L123

jeromekelleher commented 1 year ago

That would be it - should just be np.nan I guess.

nspope commented 1 year ago

A small complication-- if the SLiM engine is used, then mutations w/ fitness effects may be placed on masked intervals (which seem to be simulated as having 0-recombination rate). Though these mutations will ultimately be removed by _recap_and_rescale, they could influence the genealogies in unmasked regions.

import stdpopsim                                                                                                                                                                              

homsap = stdpopsim.get_species("HomSap")
chr1_left_edge = homsap.get_contig("chr1", genetic_map="HapMapII_GRCh38", right=100000)

# deleterious DFE over entire contig
mt = stdpopsim.MutationType(
    distribution_type="f",
    dominance_coeff=1.0,
    distribution_args=[-0.01],
)
dfe = stdpopsim.DFE(
    id="new_mutation",
    mutation_types=[mt],
    proportions=[1.0],
    description="",
    long_description="",
)
chr1_left_edge.add_dfe(intervals=[[0, chr1_left_edge.length]], DFE=dfe)

engine = stdpopsim.get_engine("slim")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(
    contig=chr1_left_edge, 
    demographic_model=model, 
    samples={"pop_0":2}, 
    slim_scaling_factor=4, 
    _recap_and_rescale=False, 
    seed=1024,
)

print(chr1_left_edge.recombination_map.asdict()) # First 55Kb masked in recmap
# {'position': array([     0.,  55550.,  82571.,  88169., 100000.]), 
#  'rate': array([         nan, 2.981822e-08, 2.082414e-08, 2.484956e-08])}

print(sim.first().num_edges) # Without recap/rescale, there's a tree in masked region
# 913

print([i for i in sim.breakpoints()][:5]) # Looks like masked region is simulated as nonrecombining                                                                                                                                
# [0.0, 55826.0, 55850.0, 56041.0, 58092.0]                     

print(sim.first().num_mutations) # ... and there are deleterious mutations on that tree
# 5
print([v.position for v in sim.variants()])
# [19521.0, 19742.0, 22695.0, 36215.0, 50202.0, 77121.0, 78131.0, 83261.0] 

The thing to do might be to mask DFE intervals in add_dfe, according to where there's missing data in the recombination map?

andrewkern commented 1 year ago

yes this is a problem @nspope, and one that is pretty context specific. That said, in empirical data masked regions are generally highly repetitive and gene poor (e.g. heterochromatic regions of genomes). If we are masking low recombination regions, those are indeed likely to be gene poor, so perhaps this really is a small issue?

petrelharp commented 1 year ago

We should really be consistent here - is "recombination rate nan" how we are communicating "this region is no good, don't simulate anything here"? If so, we should propagate that choice through to DFEs as well (so, perhaps simulate would have a step where it masks any DFEs by the bad regions).

nspope commented 1 year ago

Yeah I agree. Calling it "masked regions" or "missing data" was putting it the wrong way, as these terms could be conflated with post-simulation masking.

Instead, we don't know the parameter values over an interval, and so can't simulate it. My inclination is to follow msprime:

Fixing #1404 should resolve this issue, I think -- when we extract a chunk of a chromosome, we'll use RateMap.slice to set the flanking part of the rate map to NaN, so that simulation isn't performed over the flanks and the coordinate system matches the original chromosome. The DFE intervals will be trimmed in Contig.add_dfe to exclude the flanks. I don't think there'd be any performance cost to simulating these in SLiM, because no mutations or recombination events would ever occur on the flanks. So we'd just need to add a test or two to ensure this is the case.