Open nspope opened 1 year ago
Looks like that's because this region is masked ("blank" trees) in the output tree sequence for both engines.
The hapmap files for both genetic maps start at ~15 Mb, so I assume that to fix this we'd need to add a line at the beginning specifying a 0 rate (so this region is not treated as "missing data")?
related to #1387
To get to the bottom of this I decided to look at branch stats instead.
With the SLiM engine, we get non-zero segsites for the first few windows when _recap_and_recale=False
, but these we go to 0 when _recap_and_rescale=True
.
With the msprime engine, they are always zero.
>>> mapname="HapMapII_GRCh37"
>>> sp = stdpopsim.get_species("HomSap")
>>> chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
>>> constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
>>> breaks = np.linspace(0, chr22.length, 7)
>>> breaks_mb = np.round(breaks/1e6, 1)
>>>
>>> eng = stdpopsim.get_engine("slim")
>>> ts_slim = eng.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4, _recap_and_rescale=False)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/slim_engine.py:1523: SLiMScalingFactorWarning: You're using a scaling factor (4). This should give similar results for many situations, but is not equivalent, especially in the presence of selection. When using rescaling, you should be careful---do checks and compare results across different values of the scaling factor.
warnings.warn(
>>> div_slim = ts_slim.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
... print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
...
segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 8085.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 7944.9128727
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 7531.1644341
>>> msp = stdpopsim.get_engine("msprime")
>>> ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
>>> div_msp = ts_msp.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
... print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")
...
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 3081.1006886320965
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 12736.400180715054
>>> mapname="HapMapII_GRCh37"
>>> sp = stdpopsim.get_species("HomSap")
>>> chr22 = sp.get_contig("chr22", genetic_map=mapname, right=20e6)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/genetic_maps.py:127: UserWarning: Recombination map has length 51229805.0, which is longer than chromosome length 50818468. The latter will be used.
warnings.warn(
>>> constant_popsize = stdpopsim.PiecewiseConstantSize(1000)
>>> breaks = np.linspace(0, chr22.length, 7)
>>> breaks_mb = np.round(breaks/1e6, 1)
>>>
>>> eng = stdpopsim.get_engine("slim")
>>> ts_slim = eng.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10}, slim_scaling_factor=4)
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/stdpopsim/slim_engine.py:1523: SLiMScalingFactorWarning: You're using a scaling factor (4). This should give similar results for many situations, but is not equivalent, especially in the presence of selection. When using rescaling, you should be careful---do checks and compare results across different values of the scaling factor.
warnings.warn(
/home/murillor/projects/anaconda3/envs/analysis2/lib/python3.10/site-packages/msprime/ancestry.py:831: TimeUnitsMismatchWarning: The initial_state has time_units=ticks but time is measured in generations in msprime. This may lead to significant discrepancies between the timescales. If you wish to suppress this warning, you can use, e.g., warnings.simplefilter('ignore', msprime.TimeUnitsMismatchWarning)
warnings.warn(message, TimeUnitsMismatchWarning)
>>> div_slim = ts_slim.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_slim):
... print(f"segsites slim/{mapname} [{left}-{right} Mb]: {div}")
...
segsites slim/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites slim/HapMapII_GRCh37 [13.3-16.7 Mb]: 2195.502231990335
segsites slim/HapMapII_GRCh37 [16.7-20.0 Mb]: 13879.990809453386
>>> msp = stdpopsim.get_engine("msprime")
>>> ts_msp = msp.simulate(contig=chr22, demographic_model=constant_popsize, samples={"pop_0" : 10})
>>> div_msp = ts_msp.segregating_sites(windows=breaks, mode="branch")
>>> for left, right, div in zip(breaks_mb[:-1], breaks_mb[1:], div_msp):
... print(f"segsites msp/{mapname} [{left}-{right} Mb]: {div}")
...
segsites msp/HapMapII_GRCh37 [0.0-3.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [3.3-6.7 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [6.7-10.0 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [10.0-13.3 Mb]: 0.0
segsites msp/HapMapII_GRCh37 [13.3-16.7 Mb]: 2172.3004433208343
segsites msp/HapMapII_GRCh37 [16.7-20.0 Mb]: 14758.390178399126
So it looks like msprime
is at fault here --- it is removing the trees where rec rate is 0.
wouldn't it be where the rec rate is nan, not 0?
oops, yes, that is what I meant!
Missing data in RateMap
is documented here ... I think the issue is with our chr22 maps (which should explicitly mark the recombination rate as 0).
Trimming out the trees with missing data on the flanks is definitely by design, as we've had a number of cases where these large zero-recombination rate bits flanking a chromosome have really messed up statistics.
Do we really know that the start of chr22 has a recombination rate of 0, or is it just that there's not much data there because of sequence weirdness?
Weirdness is probably the answer @jeromekelleher -- looks like the short arm of chr22 (0 to ~15Mb) contains many repetitive regions that are homologous to (and may recombine with) parts of other chromosomes.
One thing that confuses me with these maps is that they invariably contain missing data at the start, but 0 recombination rate regions at the end. For example,
import stdpopsim
homsap = stdpopsim.get_species("HomSap")
chr22 = homsap.get_contig("chr22", genetic_map="HapMapII_GRCh38")
engine = stdpopsim.get_engine("msprime")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(contig=chr22, demographic_model=model, samples={"pop_0":2})
print(chr22.length)
# 50818468.0
print(chr22.recombination_map.asdict())
# {'position': array([ 0., 15287921., 15295878., ..., 50785209., 50791377., 50818468.]),
# 'rate': array([ nan, 6.74364e-09, 4.31139e-09, ..., 6.48707e-09, 6.55239e-09, 0.00000e+00])}
print([sim.first().num_edges, sim.last().num_edges])
# [0, 6]
print(list(sim.breakpoints())[-5:])
# [50714355.0, 50756801.0, 50760002.0, 50779378.0, 50818468.0]
so, the last bin (from 50791377 to 50818468) is not masked but instead simulated with a zero recombination rate, which results in a single tree that spans from 50779378 to the chromosome end in this simulation. In contrast the first bin is missing so contains "empty" trees in the output.
All the chromosomes are like this-- just seems odd to me that the chromosome start/end aren't treated the same?
That does sound like we're not treating the ends consistently, and should mask the ends as well. A topic for the next stdpopsim meeting next week?
Agreed, sounds like the ends aren't being handled properly.
Here is where the problem lies:
That would be it - should just be np.nan I guess.
A small complication-- if the SLiM engine is used, then mutations w/ fitness effects may be placed on masked intervals (which seem to be simulated as having 0-recombination rate). Though these mutations will ultimately be removed by _recap_and_rescale
, they could influence the genealogies in unmasked regions.
import stdpopsim
homsap = stdpopsim.get_species("HomSap")
chr1_left_edge = homsap.get_contig("chr1", genetic_map="HapMapII_GRCh38", right=100000)
# deleterious DFE over entire contig
mt = stdpopsim.MutationType(
distribution_type="f",
dominance_coeff=1.0,
distribution_args=[-0.01],
)
dfe = stdpopsim.DFE(
id="new_mutation",
mutation_types=[mt],
proportions=[1.0],
description="",
long_description="",
)
chr1_left_edge.add_dfe(intervals=[[0, chr1_left_edge.length]], DFE=dfe)
engine = stdpopsim.get_engine("slim")
model = stdpopsim.PiecewiseConstantSize(1000)
sim = engine.simulate(
contig=chr1_left_edge,
demographic_model=model,
samples={"pop_0":2},
slim_scaling_factor=4,
_recap_and_rescale=False,
seed=1024,
)
print(chr1_left_edge.recombination_map.asdict()) # First 55Kb masked in recmap
# {'position': array([ 0., 55550., 82571., 88169., 100000.]),
# 'rate': array([ nan, 2.981822e-08, 2.082414e-08, 2.484956e-08])}
print(sim.first().num_edges) # Without recap/rescale, there's a tree in masked region
# 913
print([i for i in sim.breakpoints()][:5]) # Looks like masked region is simulated as nonrecombining
# [0.0, 55826.0, 55850.0, 56041.0, 58092.0]
print(sim.first().num_mutations) # ... and there are deleterious mutations on that tree
# 5
print([v.position for v in sim.variants()])
# [19521.0, 19742.0, 22695.0, 36215.0, 50202.0, 77121.0, 78131.0, 83261.0]
The thing to do might be to mask DFE intervals in add_dfe
, according to where there's missing data in the recombination map?
yes this is a problem @nspope, and one that is pretty context specific. That said, in empirical data masked regions are generally highly repetitive and gene poor (e.g. heterochromatic regions of genomes). If we are masking low recombination regions, those are indeed likely to be gene poor, so perhaps this really is a small issue?
We should really be consistent here - is "recombination rate nan" how we are communicating "this region is no good, don't simulate anything here"? If so, we should propagate that choice through to DFEs as well (so, perhaps simulate
would have a step where it masks any DFEs by the bad regions).
Yeah I agree. Calling it "masked regions" or "missing data" was putting it the wrong way, as these terms could be conflated with post-simulation masking.
Instead, we don't know the parameter values over an interval, and so can't simulate it. My inclination is to follow msprime
:
Fixing #1404 should resolve this issue, I think -- when we extract a chunk of a chromosome, we'll use RateMap.slice
to set the flanking part of the rate map to NaN, so that simulation isn't performed over the flanks and the coordinate system matches the original chromosome. The DFE intervals will be trimmed in Contig.add_dfe
to exclude the flanks. I don't think there'd be any performance cost to simulating these in SLiM, because no mutations or recombination events would ever occur on the flanks. So we'd just need to add a test or two to ensure this is the case.
For example:
gives