Closed stsmall closed 1 year ago
This failed because Snakefile is looking for simulation.smk and n_t.smk. I was using those file ext during testing, but reverted for the PR since the main was using file ext snake. I was running modules directly so didnt catch this name mismatch
@stsmall do you want to push another commit moving the filenames to .smk
? or hold off until another, separate and small PR?
I left the file names as .snake so they triggered a file comparison with merge against main. Once I changed the Snakefile directly (772b0b6), the PR passed all checks. So let's push the smk extension after the PR is good to go
i related issue I'm seeing-- it looks like the pipeline throws errors if I run it with snakemake -c 10 all
but if I specify individual workflow files (e.g. snakemake -c 10 --snakefile workflows/simulation.snake
) it works.
i'm guessing this has to do with the expansions? have you seen this behavior?
I ran as separate modules to avoid the dfe calculations. So no havent seen that error.
snakemake -c 10 --snakefile workflows/simulation.snake
snakemake -c 10 --snakefile workflows/n_t.snake
Testing now and throwing an error when running snakemake -c 10 all
... n_t seems to be looking for the simulated_data before it exists in all
so the issue is that there is not dependency on the simulation input in either stairwayplot or gone in rules:
gone_prep_inputs
run_stairwayplot
so adding this line
expand(output_dir + "/simulated_data/{{demog}}/{{dfes}}/{{annots}}/{{seeds}}/sim_{chrms}.trees", chrms=chrm_list)
to gone_prep_inputs and run_stairwayplot seems to have fixed that error. How do I submit the patch?
all you do is push commits to this same branch. looks like you've done that.
I'm still seeing errors though in trying to run the all
rule on the Snakefile
ERROR:snakemake.logging:RuleException:
UnboundLocalError in file /home/adkern/analysis2/workflows/simulation.snake, line 88:
local variable 'dfe' referenced before assignment
File "/home/adkern/analysis2/workflows/simulation.snake", line 88, in __rule_simulation
File "/home/adkern/miniconda3/envs/analysis2/lib/python3.10/concurrent/futures/thread.py", line 58, in run
right ... what is that error? I dont get it if I comment out dfe.snake. But I cant figure out why it is there ... for some reason it is not evaluating:
if wildcards.dfes != "none":
# Load dfe only if provided
dfe = species.get_dfe(wildcards.dfes)
I checked the wildcards and even eval in the pdb ... but it keeps skipping that if statement which then leads to no value of dfe in:
if wildcards.annots == "all_sites":
# Adding selection to the whole contig
contig.add_dfe(intervals=np.array([[0, int(contig.length)]]), DFE=dfe)
elif wildcards.annots == "none":
contig = species.get_contig(wildcards.chrms, genetic_map=genetic_map_id)
else:
## Adding annotation only seletion on exon region
annot = species.get_annotations(wildcards.annots)
annot_intervals = annot.get_chromosome_annotations(wildcards.chrms)
contig.add_dfe(intervals=annot_intervals, DFE=dfe)
I didnt alter this code in simulation.snake, it is the exact same as in Main. I am testing a fix right now, will update in a few
so the error is associated with dfe.snake not looking for the zip(dfe, annot) addition I made to the simulation.snake. I can revert my code to not need the zip() or adjust the dfe.snake to use the zip()
not sure it makes sense to have: none/ensembl_havana_104_exons (which is what dfe.snake is expecting) ... unless I am missing something, what does no DFE but still annotations means for the analysis?
MissingInputException in rule dfe_generate_dadi_fs in file /home/ssmall/Documents/projects/stdpopsim/analysis2/workflows/dfe.snake, line 81:
Missing input files for rule dfe_generate_dadi_fs:
output: /home/ssmall/Documents/projects/stdpopsim/analysis2/results/inference/OutOfAfrica_3G09/dadi/none/ensembl_havana_104_exons/1845187043/chr3/pop0/pop0.dadi.neu.fs, /home/ssmall/Documents/projects/stdpopsim/analysis2/results/inference/OutOfAfrica_3G09/dadi/none/ensembl_havana_104_exons/1845187043/chr3/pop0/pop0.dadi.nonneu.fs
wildcards: demog=OutOfAfrica_3G09, dfes=none, annots=ensembl_havana_104_exons, seeds=1845187043, chrms=chr3, ids=0
affected files:
/home/ssmall/Documents/projects/stdpopsim/analysis2/results/simulated_data/OutOfAfrica_3G09/none/ensembl_havana_104_exons/1845187043/sim_chr3.trees
ERROR:snakemake.logging:MissingInputException in rule dfe_generate_dadi_fs in file /home/ssmall/Documents/projects/stdpopsim/analysis2/workflows/dfe.snake, line 81:Missing input files for rule dfe_generate_dadi_fs:
output: /home/ssmall/Documents/projects/stdpopsim/analysis2/results/inference/OutOfAfrica_3G09/dadi/none/ensembl_havana_104_exons/1845187043/chr3/pop0/pop0.dadi.neu.fs, /home/ssmall/Documents/projects/stdpopsim/analysis2/results/inference/OutOfAfrica_3G09/dadi/none/ensembl_havana_104_exons/1845187043/chr3/pop0/pop0.dadi.nonneu.fs
wildcards: demog=OutOfAfrica_3G09, dfes=none, annots=ensembl_havana_104_exons, seeds=1845187043, chrms=chr3, ids=0
affected files:
/home/ssmall/Documents/projects/stdpopsim/analysis2/results/simulated_data/OutOfAfrica_3G09/none/ensembl_havana_104_exons/1845187043/sim_chr3.trees
based on the if/else eval ... I dont know how dfe=none annot=exons would not throw an error. You would drop to the else
statement in the below code which doesnt have dfe defined since dfe = none
.
if wildcards.dfes != "none":
# Load dfe only if provided
dfe = species.get_dfe(wildcards.dfes)
if wildcards.annots == "all_sites":
# Adding selection to the whole contig
contig.add_dfe(intervals=np.array([[0, int(contig.length)]]), DFE=dfe)
elif wildcards.annots == "none":
contig = species.get_contig(wildcards.chrms, genetic_map=genetic_map_id)
else:
## Adding annotation only seletion on exon region
annot = species.get_annotations(wildcards.annots)
annot_intervals = annot.get_chromosome_annotations(wildcards.chrms)
contig.add_dfe(intervals=annot_intervals, DFE=dfe)
looking back through the configs ... I see that it is only ever:
dfe = ["Gamma_H17"] w/ annot =["exons", "all-sites"] OR dfe=["none"] so my config file is passing illegal values to dfe.snake
The zip() change I made allows for multiple dfe/annots to be run all in the same snakemake. comparing results of neutral sims and selection sims in the plot ... this was not possible before in the same snakefile dfe = [none, gamma_h17, gamma_h17] annot=[none, exons, all-sites] produces pairs: (none,none), (gamma_h17, exons), (gamma_h17, all-sites) ... without the rule breaking (none, exons), (none, all-sites)
OK ... I think I fixed it. really very simple. I just had to remove "none" from dfe and annot lists for input into dfe.snake. Initially I tried to pass different configs, but couldnt get that to work. Thought that local configs over-wrote the global config params ... couldnt figure it out. So I just added a few lines to remove "none" from dfe_list and annotation_list before running dfe.snake.
so the only changes to dfe.snake: 1) change get_mask_file to get_mask_file_dfe 2) remove "none" from dfe_list and annotation_list in the top of the file
okay i just committed a change which fixes the bugs on the dadi
side. I think we are there?
okay @stsmall this is looking ready to me. I wonder if anyone else wants to lay eyes on this PR before we merge? @chriscrsmith @lntran26?
If there is time I would look at it again tomorrow—otherwise go ahead and merge
Confirm that everything seems to be running smoothly for me. Not all jobs have finished running yet but the pipeline submits everything to the cluster and is running without errors for the simulation and N(t) side so far. I ran into some error with the DFE side (dfe_download_dfe_alpha
and dfe_dadi_infer_dm
: dadi-cli: error: unrecognized arguments: --cpus 8
). I will inspect the results more closely once the run completes but think that we can merge.
@lntran26: and is this with the latest update on the dfe side?
@lntran26 I think this error is because of a dadi
version error on your environment. the new environment requires - dadi==2.2.0
so you'll need to update that (or recreate the python env)
@chriscrsmith Yes. I pulled the latest version from main then switched to a new branch to test out this PR. @andrewkern I did update the env using the newest environment.yaml file and it definitely pulled dadi-cli from the repo as well as updated dadi to dadi==2.2.0
. It seems like I'm still on the older version of dadi-cli somehow, hence the dadi-cli: error: unrecognized arguments: --cpus 8
(the old code used --threads). Maybe I need to recreate the env fresh since updating didn't fully do it.
Yes I'd guess something weird is happening in your env.
if there are no further updates from reviewers here I vote we merge-- any objections?
Do you want me to look through it today?
sure Chris!
Overall confirm that I was able to produce the plot that looks like what @stsmall showed! Perhaps my Gone run for YRI somehow looks a bit better? Except for the errorbar
error I don't have any other issues.
What's the status on snakemake -c M all
versus running the individual workflows?
I am getting a bunch of red with snakemake -c M all
but maybe we were dealing with that later?
running all
is working for a bunch of us. what sort of errors are you getting?
i dont have time right now to go through all of @chriscrsmith changes, but many of his suggestions will break the n_t.snake code. I will attempt to get to them tonight
first, getting some files like analysis2/hs_err_pid109804.log
due to insufficient memory. They don't help narrow down the specific rule
@chriscrsmith can you please post your conda env e.g., conda list
nuked conda and it got past the smcpp install :grimacing:
getting ooms with msmc on Oregon cluster—asking for more mem in the config helps.
Did anything change in the past week that would affect that?
perhaps, and I haven't tried to run the pipeline on the cluster since this PR
yep-- I've confirmed this OOM error on our system. I'll push a commit to n_t.snake
that specifies the mem requirements for the run_msmc
rule
I also noticed jobs were running past the 1 hr wall time. Increased the wall time, and my test run is still going from earlier today...
those same run_msmc jobs?
yeah. The snakemake command sort of idled. The logs showed msmc had finished 8/20 iterations or so
roger-- i can tack wall time on that rule too
mine just finished. I had set wall time to 300 as a test (observed time was 3hr 40min, including multiple batches of jobs)
300 hours?
minutes I assume. I didn't specify the units. —oops, 600 minutes!
default-resources:
- time=600
okay i pushed a commit that tweaks the Oregon cluster profile, adds a mem_mb=24000 resource to run_msmc
rule, and moved localrules to a higher level in the workflow.
I think at this point I'd like to merge this PR and harmonize the sweep side of things with what we have here.
sounds good!
directory structure
results/simulated_data # NO CHANGE results/inference/{gone,msmc,smcpp,stairway}/demog/dfe/annotations/seed/{wildcards.pops} results/plots/demog/estimated_Ne_t.final{csv, pdf} results/plots/demog/{gone, msmc, smcpp, stairway}/{method}_estimated_Ne_t.{csv, pdf}
-All data are output in a csv table in long format, meaning easy plotting in ggplot2 or seaborn -added pop wildcard to n_t.snake. This allows for the parallel calc of all pops in the simulated data -separated coal/census size references to a table located at results/plots/demog/coal_estimated_Net.csv -stairwayplot uses all chromosomes along with a list of masked intervals for each chromosome -gone uses all chromosomes along with a list of masked intervals for each chromosome -msmc now uses the -I flag to select number of haplotypes -smcpp now included the -d flag to use 2 random haplotypes to aid in the inference -d: SMC++ relies crucially on the notion of a pair of distinguished lineages (see paper for details on this terminology). The identity of the distinguished lineages is set using the -d option, which specifies the sample(s) which will form the distinguished pair._ -plotting functions for n_t updated to seaborn facet plots for readability.