Closed lntran26 closed 2 years ago
Ah-ha - good sleuthing. And, the problem is this line:
contig_non_neu.clear_dfes()
When the contig is set up, it has by default a neutral DFE, everywhere. By removing it, you're asking that no neutral mutations be simulated in the remainder of the contig - so, you only get mutations in the exons. This is left over from how an early draft of DFEs worked.
Can someone please grep clear_dfes **/*.py
to check if this function is used anywhere else in the analysis2 code? (it shouldn't be)
Removing that line, I get roughly the same number of segregating sites for both:
>>> ts_neu.num_sites
67370
>>> ts_non_neu.num_sites
63160
Thank you so much for the explanation! The clear_dfes() is used in the pipeline at least once in simulation.snake, which is where I got that line from. I'll make sure to update it for future runs and double check in other places.
Ah-ha - good sleuthing. And, the problem is this line:
contig_non_neu.clear_dfes()
When the contig is set up, it has by default a neutral DFE, everywhere. By removing it, you're asking that no neutral mutations be simulated in the remainder of the contig - so, you only get mutations in the exons. This is left over from how an early draft of DFEs worked.
Can someone please
grep clear_dfes **/*.py
to check if this function is used anywhere else in the analysis2 code? (it shouldn't be)Removing that line, I get roughly the same number of segregating sites for both:
>>> ts_neu.num_sites 67370 >>> ts_non_neu.num_sites 63160
It is used in https://github.com/popsim-consortium/analysis2/blob/main/workflows/simulation.snake#L76 So we should remove this line?
So we should remove this line?
Yes. (but perhaps you could double-check )
We don't need the non-exonic SNPs for DFE inference, which explains why this issue wasn't affecting @xin-huang . But if want to use the same simulations for all analyses, it should go there too.
When running the analysis pipeline on simulations generated with selection, I found that there were very few SNPs in the data. Upon closer inspection, this doesn't seem to be a problem with the masking code but rather with the simulation with DFE. Even before any mask is applied, the total SNPs in the tree sequence is much lower when simulating with DFE than without DFE, and most of the SNPs are inside the exon intervals, which is about 1.4% of the chromosome length. Below is some example code. Unless I'm not using it correctly we might need to double check the SLiM DFE implementation in stdpopsim and/or how we're using it in the pipeline.