Closed lntran26 closed 3 months ago
one thing about GONE
is that it should only really be reliable for the most recent history. @chriscrsmith do you have a sense of the relevant timescale?
I think you're right. Supposedly "Within the range of the most recent 200 generations, GONE outperforms the other methods" (https://doi.org/10.1093/molbev/msaa169)
Since @mufernando and @petrelharp asked on the call this morning I try to describe the bug I currently have here:
Example error for run_stairwayplot
:
RuleException:
ValueError in line 208 of /groups/rgutenk/lnt/projects/stdpopsim/analysis2/workflows/n_t.snake:
Annotations 'HomSap/none' not in catalog (ensembl_havana_104_exons, ensembl_havana_104_CDS)
Example error for ts_to_multihep
for running msmc:
RuleException:
ValueError in line 362 of /groups/rgutenk/lnt/projects/stdpopsim/analysis2/workflows/n_t.snake:
Annotations 'HomSap/none' not in catalog (ensembl_havana_104_exons, ensembl_havana_104_CDS)
In n_t.snake
this is the new code for masking intervals. I think the issue is chrom_annotation=wildcards.annots
expects an annotation name that is in the catalog so the string "none" is not an option. I currently have "annotation_list": ["none"]
instead of "annotation_list": ["ensembl_havana_104_exons"]
in the config.yml
file, which can be handled by simulation.snake
(line 81-82):
elif wildcards.annots == "none":
contig = species.get_contig(wildcards.chrms, genetic_map=genetic_map_id)
but not currently handled in n_t.snake
. This could potentially be fixed by adding a few lines of code in n_t.snake
to handle this "none" case, which I plan to do. If anyone has other suggestions please let me know!
Update: After some minor changes, the neutral analysis seems to be working as expected now, except for our GONE
implementation, which might require some further testing/tweaking. Below is result for slim scaling factor 10 and sample size 10.
chr1:
chr21:
Neutral results for chr1 with slim scaling factor 1 (with perhaps some improvement in GONE at least for the Constant model):
Does GONE have a generation_time parameter in one of the config files that should be changed? [edit: or other params in the GONE config]
Is seems that there is no timescale at which GONE is getting things right here, so the idea that it is only accurate within the last X generations doesn't seem to be the explanation. Maybe a scaling issue could be contributing somewhat though? The constant-sized case is looking less bad for super-recent time (<10 gen) now maybe?
tagging @stsmall here as he as some experience with GONE
Tagging in to update on our discussion today. I think the issue is that there is not enough data for GONE to make a precise inference. @lntran26 and I will try to figure out a solution, but it is possible that GONE will not be useful or may take a long time to run. E.g., the authors state a run time of 9 hours for the human data from the paper.
I think this paper may offer an alternative, but my initial tests were frustrated by the hard-coded values. https://www.biorxiv.org/content/10.1101/2022.08.03.501074v1
As discussed in some of our coding meetings, to establish a neutral baseline for the current analysis 2 pipeline, I tried running neutral simulation and N(t) analysis, using the current
config.yaml
setting with SLiM scaling factor 15 and two demographic models. DFE and annotation are set to none. Below is what the results look like so far. I included the mask file for low recombination regions of the corresponding genetic map (HapmapII_GRCh37.mask.bed
) and I believe that at the time of running only msmc has a working masking function. There are some issues with the stairway plots as mentioned in #71. To me the msmc results here seem more reasonable than the results with selection so far, but I'm not sure about GONE. Maybe we need to rerun with masking for GONE. I can rerun this analysis as we make updates to the code and post the updated results here.