popsim-consortium / analysis2

Analysis for the second consortium paper.
8 stars 14 forks source link

Lower diversity than expected when simulating tree sequence with gamma DFE #75

Closed lntran26 closed 2 years ago

lntran26 commented 2 years ago

When running the analysis pipeline on simulations generated with selection, I found that there were very few SNPs in the data. Upon closer inspection, this doesn't seem to be a problem with the masking code but rather with the simulation with DFE. Even before any mask is applied, the total SNPs in the tree sequence is much lower when simulating with DFE than without DFE, and most of the SNPs are inside the exon intervals, which is about 1.4% of the chromosome length. Below is some example code. Unless I'm not using it correctly we might need to double check the SLiM DFE implementation in stdpopsim and/or how we're using it in the pipeline.

import stdpopsim
import numpy as np

# shared settings
species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("OutOfAfrica_3G09")
samples = model.get_samples(10, 10, 10)
engine = stdpopsim.get_engine("slim")

# simulate contig without DFE
contig_neu = species.get_contig("chr21", genetic_map="HapMapII_GRCh38")
contig_neu.mutation_rate = 1.29e-08
ts_neu = engine.simulate(model, contig_neu, samples, slim_scaling_factor=30, slim_burn_in=10, seed=12345)
ts_neu.sequence_length, ts_neu.genotype_matrix().shape
# (46709983.0, (67370, 30))

# simulate contig with DFE
contig_non_neu = species.get_contig("chr21", genetic_map="HapMapII_GRCh38")
contig_non_neu.mutation_rate = 1.29e-08
annot = species.get_annotations("ensembl_havana_104_exons")
annot_chr = annot.get_chromosome_annotations("chr21")
contig_non_neu.clear_dfes()
dfe = species.get_dfe("Gamma_H17")
contig_non_neu.add_dfe(intervals=annot_chr, DFE=dfe)
ts_non_neu = engine.simulate(model, contig_non_neu, samples, slim_scaling_factor=30, slim_burn_in=10, seed=12345)
ts_non_neu.sequence_length, ts_non_neu.genotype_matrix().shape
# (46709983.0, (872, 30))

# how many of 872 SNPs are in exon intervals
positions = ts_non_neu.sites_position
index_site_in_exon = []
for interval in annot_chr:
    masked_idx = np.where(np.logical_and(positions>=interval[0], positions<interval[1]))[0]
    if len(masked_idx) != 0:
        index_site_in_exon += masked_idx.tolist()
len(index_site_in_exon)
# 776

exon_total = np.sum(annot_chr[:,1] - annot_chr[:,0])
exon_total / ts_neu.sequence_length
# 0.014866821938256754
petrelharp commented 2 years ago

Ah-ha - good sleuthing. And, the problem is this line:

contig_non_neu.clear_dfes()

When the contig is set up, it has by default a neutral DFE, everywhere. By removing it, you're asking that no neutral mutations be simulated in the remainder of the contig - so, you only get mutations in the exons. This is left over from how an early draft of DFEs worked.

Can someone please grep clear_dfes **/*.py to check if this function is used anywhere else in the analysis2 code? (it shouldn't be)

Removing that line, I get roughly the same number of segregating sites for both:

>>> ts_neu.num_sites
67370
>>> ts_non_neu.num_sites
63160
lntran26 commented 2 years ago

Thank you so much for the explanation! The clear_dfes() is used in the pipeline at least once in simulation.snake, which is where I got that line from. I'll make sure to update it for future runs and double check in other places.

xin-huang commented 2 years ago

Ah-ha - good sleuthing. And, the problem is this line:

contig_non_neu.clear_dfes()

When the contig is set up, it has by default a neutral DFE, everywhere. By removing it, you're asking that no neutral mutations be simulated in the remainder of the contig - so, you only get mutations in the exons. This is left over from how an early draft of DFEs worked.

Can someone please grep clear_dfes **/*.py to check if this function is used anywhere else in the analysis2 code? (it shouldn't be)

Removing that line, I get roughly the same number of segregating sites for both:

>>> ts_neu.num_sites
67370
>>> ts_non_neu.num_sites
63160

It is used in https://github.com/popsim-consortium/analysis2/blob/main/workflows/simulation.snake#L76 So we should remove this line?

petrelharp commented 2 years ago

So we should remove this line?

Yes. (but perhaps you could double-check )

RyanGutenkunst commented 2 years ago

We don't need the non-exonic SNPs for DFE inference, which explains why this issue wasn't affecting @xin-huang . But if want to use the same simulations for all analyses, it should go there too.