Simulations and N(t) analysis without selection

lntran26 commented 2 years ago

As discussed in some of our coding meetings, to establish a neutral baseline for the current analysis 2 pipeline, I tried running neutral simulation and N(t) analysis, using the current config.yamlsetting with SLiM scaling factor 15 and two demographic models. DFE and annotation are set to none. Below is what the results look like so far. I included the mask file for low recombination regions of the corresponding genetic map (HapmapII_GRCh37.mask.bed) and I believe that at the time of running only msmc has a working masking function. There are some issues with the stairway plots as mentioned in #71. To me the msmc results here seem more reasonable than the results with selection so far, but I'm not sure about GONE. Maybe we need to rerun with masking for GONE. I can rerun this analysis as we make updates to the code and post the updated results here.

neutral_constant_all_estimated_Ne neutral_OOF_all_estimated_Ne

andrewkern commented 2 years ago

one thing about GONE is that it should only really be reliable for the most recent history. @chriscrsmith do you have a sense of the relevant timescale?

chriscrsmith commented 2 years ago

I think you're right. Supposedly "Within the range of the most recent 200 generations, GONE outperforms the other methods" (https://doi.org/10.1093/molbev/msaa169)

lntran26 commented 2 years ago

Since @mufernando and @petrelharp asked on the call this morning I try to describe the bug I currently have here: Example error for run_stairwayplot :

RuleException:
ValueError in line 208 of /groups/rgutenk/lnt/projects/stdpopsim/analysis2/workflows/n_t.snake:
Annotations 'HomSap/none' not in catalog (ensembl_havana_104_exons, ensembl_havana_104_CDS)

Example error for ts_to_multihep for running msmc:

RuleException:
ValueError in line 362 of /groups/rgutenk/lnt/projects/stdpopsim/analysis2/workflows/n_t.snake:
Annotations 'HomSap/none' not in catalog (ensembl_havana_104_exons, ensembl_havana_104_CDS)

In n_t.snake this is the new code for masking intervals. I think the issue is chrom_annotation=wildcards.annots expects an annotation name that is in the catalog so the string "none" is not an option. I currently have "annotation_list": ["none"] instead of "annotation_list": ["ensembl_havana_104_exons"] in the config.yml file, which can be handled by simulation.snake (line 81-82):

elif wildcards.annots == "none":
    contig = species.get_contig(wildcards.chrms, genetic_map=genetic_map_id)

but not currently handled in n_t.snake. This could potentially be fixed by adding a few lines of code in n_t.snake to handle this "none" case, which I plan to do. If anyone has other suggestions please let me know!

lntran26 commented 2 years ago

Update: After some minor changes, the neutral analysis seems to be working as expected now, except for our GONE implementation, which might require some further testing/tweaking. Below is result for slim scaling factor 10 and sample size 10. chr1: all_estimated_Ne_constant_chr1 all_estimated_Ne_OOF_chr1

chr21: all_estimated_Ne_constant all_estimated_Ne_OOF

lntran26 commented 2 years ago

Neutral results for chr1 with slim scaling factor 1 (with perhaps some improvement in GONE at least for the Constant model):

all_estimated_Ne_constant all_estimated_Ne_OOF

chriscrsmith commented 2 years ago

Does GONE have a generation_time parameter in one of the config files that should be changed? [edit: or other params in the GONE config]

dschride commented 2 years ago

Is seems that there is no timescale at which GONE is getting things right here, so the idea that it is only accurate within the last X generations doesn't seem to be the explanation. Maybe a scaling issue could be contributing somewhat though? The constant-sized case is looking less bad for super-recent time (<10 gen) now maybe?

andrewkern commented 2 years ago

tagging @stsmall here as he as some experience with GONE

stsmall commented 1 year ago

Tagging in to update on our discussion today. I think the issue is that there is not enough data for GONE to make a precise inference. @lntran26 and I will try to figure out a solution, but it is possible that GONE will not be useful or may take a long time to run. E.g., the authors state a run time of 9 hours for the human data from the paper.

stsmall commented 1 year ago

I think this paper may offer an alternative, but my initial tests were frustrated by the hard-coded values. https://www.biorxiv.org/content/10.1101/2022.08.03.501074v1

popsim-consortium / analysis2

Simulations and N(t) analysis without selection #72