proposed production-level config for humans

chriscrsmith commented 1 year ago

Soliciting feedback on the choice of settings in the proposed config. This will be what @lntran26 and I use for the hopefully final simulations in humans without scaling.

Notes:

I made a separate config for human. Thought it would be cleaner to break up sims for different species?
I think 100 samples is an increase from 20 which I've been using up to this point.
add demographic model OutOfAfricaArchaicAdmixture_5R19

stsmall commented 1 year ago

@chriscrsmith, I dont recall how to line comment on the commit. For the demofix, you need to follow the tiny_config.yaml format. specifically:

"mask_file": "workflows/masks/HapmapII_GRCh37.mask.bed"
# set any of the below to 'none' to skip annot masking
"stairway_annot_mask" : ""
"msmc_annot_mask" : ""
"gone_annot_mask" : ""
"smcpp_annot_mask" : ""
"methods" : ["stairwayplot", "gone", "smcpp", "msmc"]

stsmall commented 1 year ago

also you want the "num_msmc_iterations" to be at least 20

stsmall commented 1 year ago

dfe and annotations list need to be the same length

"dfe_list": ["Gamma_H17", "Gamma_H17"]
"annotation_list": ["all_sites", "ensembl_havana_104_exons"]

stsmall commented 1 year ago

For 'replicates' I think it should be more than 3. Not sure what an upper limit is ... maybe 10 or at least 20?

chriscrsmith commented 1 year ago

Thanks @stsmall . Ok what do you think of the new version? I left reps=3 until we get input from others. 20 sounds like too many to me

stsmall commented 1 year ago

Thanks @stsmall . Ok what do you think of the new version? I left reps=3 until we get input from others. 20 sounds like too many to me

The plots use seeds (reps) to create CI ribbons. 3 reps will just be noisy. IDK if 20 is too many or not enough, I just picked a number. Since it runs in parallel w/ the reps, shouldnt be too much a slow down to do more, right? We could always add more later, but then would have to rerun the n_t, dfe pipelines on the full dataset.

stsmall commented 1 year ago

Otherwise it looks good. :) Do we want to do more variations for msmc2? Right now it is just 6. More haps (maybe the limit is 16?) will provide better resolution of <1000 gens, which is where msmc2 really seems to go awry. The run time will get way longer and even though it is paralleled, it is still the last thing to finish.

chriscrsmith commented 1 year ago

gotcha, the ribbons. 20 reps sounds good

andrewkern commented 1 year ago

i'm a bit concerned about the compute cost of 20 reps up front. The way these runs go, we almost always have to rerun it. i think we should start with 3 reps -- if that completes in a reasonable time we can generate more reps if we want to. One way we could do this would be to have two seeds -- 1 for the first 3 reps, then a second seed for the next 17 (or whatever number..)

chriscrsmith commented 1 year ago

Otherwise it looks good. :) Do we want to do more variations for msmc2? Right now it is just 6. More haps (maybe the limit is 16?) will provide better resolution of <1000 gens, which is where msmc2 really seems to go awry. The run time will get way longer and even though it is paralleled, it is still the last thing to finish.

If it's already the longest running part of the analysis, I think let's leave for now, update later as needed?

chriscrsmith commented 1 year ago

see new commit: changed genetic map, deleted some unused parameters

I have not done a full run yet, but if I turn on scaling it seems to get off the ground ok.

chriscrsmith commented 1 year ago

There was some talk in the tuesday meeting about potentially doing the Papuan demographic model. What does everyone think?

RyanGutenkunst commented 1 year ago

That would be a flex. :-) I guess we'd assume the DFE was the same in Denisova and Neanderthal as modern humans. We'd lose the easy comparison with the previous paper, but if we run and include the neutral analysis here, that's no problem.

petrelharp commented 1 year ago

Say, @chriscrsmith - could you clarify what exactly is being proposed? Like, is there going to be just one demographic model? Or, more than one? And, what DFE(s)?

chriscrsmith commented 1 year ago

Demog.

I imagined at least running the same demographic model from the previous paper, for comparison. So, OutOfAfricaArchaicAdmixture_5R19
However we have been using the OutOfAfrica_3G09. Is there something special about this one? Do we leave this model in the analysis.
Based on Ryan's feedback I'd lean towards skipping the Papuan model. But was wondering if we should it include it alongside the other one(s).

DFEs

Gamma_K17: Kim et al. (2017), https://doi.org/10.1534/genetics.116.197145
Gamma_H17: Huber et al. (2017), https://doi.org/10.1073/pnas.1619508114

petrelharp commented 1 year ago

I'm still a bit fuzzy here - are we deciding which single model to run, or are we deciding between having 1 or 2 models? Or what? And, concretely, what goes in the paper - is this the demographic model(s) that'll be used for both (a) inferring DFEs and (b) the effect of selection on demographic inference? The same one(s) for both?

chriscrsmith commented 1 year ago

Demog.

I imagined at least running the same demographic model from the previous paper, for comparison. So, OutOfAfricaArchaicAdmixture_5R19
However we have been using the OutOfAfrica_3G09. Is there something special about this one? Here are options: 1. Do we leave this model in the analysis? 2. Take it out?
Based on Ryan's feedback I'd lean towards skipping the Papuan model. But was wondering if we should it include it alongside the other one(s). Here are options: 1. Do we use this model? 2 Skip this model?

DFEs

Gamma_K17: Kim et al. (2017), https://doi.org/10.1534/genetics.116.197145
Gamma_H17: Huber et al. (2017), https://doi.org/10.1073/pnas.1619508114

chriscrsmith commented 1 year ago

I'm still a bit fuzzy here - are we deciding which single model to run, or are we deciding between having 1 or 2 models?

I imagined at least running the same demographic model from the previous paper.

And, concretely, what goes in the paper - is this the demographic model(s) that'll be used for both (a) inferring DFEs and (b) the effect of selection on demographic inference? The same one(s) for both?

I think that makes sense.

petrelharp commented 1 year ago

I agree about using the same model as in the last paper. There is nothing special (besides being an early model and thus jumping to our minds more easily?) about OutOfAfrica_3G09.

I don't have a good sense about whether we've got room for results about more than one model - that depends on what figures we want?

chriscrsmith commented 1 year ago

Updated the PR to delete the human model we've been using, so it's now replaced with the model from the previous paper.

Here's the relevant post about our plan for the paper: #8

petrelharp commented 1 year ago

Thanks for finding the outline! =) So, current proposal is to just have one model? That seems fine to me, really - unless there's a reason to think that methods might behave differently under some methods than others? But, I guess if we're going to look at different scenarios I'd much rather look at different speices than just different human models. So: I agree!

petrelharp commented 1 year ago

In the meeting just now we decided we can merge this.

chriscrsmith commented 11 months ago

In meeting just now agreed this looks good, minus the Gamma_H17 dfe.

petrelharp commented 11 months ago

@chriscrsmith says merge!

popsim-consortium / analysis2

proposed production-level config for humans #97