popsim-consortium / analysis2

Analysis for the second consortium paper.
8 stars 14 forks source link

Demo multpop fix #85

Closed stsmall closed 1 year ago

stsmall commented 1 year ago

directory structure

results/simulated_data # NO CHANGE results/inference/{gone,msmc,smcpp,stairway}/demog/dfe/annotations/seed/{wildcards.pops} results/plots/demog/estimated_Ne_t.final{csv, pdf} results/plots/demog/{gone, msmc, smcpp, stairway}/{method}_estimated_Ne_t.{csv, pdf}

-All data are output in a csv table in long format, meaning easy plotting in ggplot2 or seaborn -added pop wildcard to n_t.snake. This allows for the parallel calc of all pops in the simulated data -separated coal/census size references to a table located at results/plots/demog/coal_estimated_Net.csv -stairwayplot uses all chromosomes along with a list of masked intervals for each chromosome -gone uses all chromosomes along with a list of masked intervals for each chromosome -msmc now uses the -I flag to select number of haplotypes -smcpp now included the -d flag to use 2 random haplotypes to aid in the inference -d: SMC++ relies crucially on the notion of a pair of distinguished lineages (see paper for details on this terminology). The identity of the distinguished lineages is set using the -d option, which specifies the sample(s) which will form the distinguished pair._ -plotting functions for n_t updated to seaborn facet plots for readability.

stsmall commented 1 year ago

This failed because Snakefile is looking for simulation.smk and n_t.smk. I was using those file ext during testing, but reverted for the PR since the main was using file ext snake. I was running modules directly so didnt catch this name mismatch

andrewkern commented 1 year ago

@stsmall do you want to push another commit moving the filenames to .smk? or hold off until another, separate and small PR?

stsmall commented 1 year ago

I left the file names as .snake so they triggered a file comparison with merge against main. Once I changed the Snakefile directly (772b0b6), the PR passed all checks. So let's push the smk extension after the PR is good to go

andrewkern commented 1 year ago

i related issue I'm seeing-- it looks like the pipeline throws errors if I run it with snakemake -c 10 all but if I specify individual workflow files (e.g. snakemake -c 10 --snakefile workflows/simulation.snake ) it works.

i'm guessing this has to do with the expansions? have you seen this behavior?

stsmall commented 1 year ago

I ran as separate modules to avoid the dfe calculations. So no havent seen that error. snakemake -c 10 --snakefile workflows/simulation.snake snakemake -c 10 --snakefile workflows/n_t.snake

Testing now and throwing an error when running snakemake -c 10 all ... n_t seems to be looking for the simulated_data before it exists in all

so the issue is that there is not dependency on the simulation input in either stairwayplot or gone in rules: gone_prep_inputs run_stairwayplot

stsmall commented 1 year ago

so adding this line expand(output_dir + "/simulated_data/{{demog}}/{{dfes}}/{{annots}}/{{seeds}}/sim_{chrms}.trees", chrms=chrm_list) to gone_prep_inputs and run_stairwayplot seems to have fixed that error. How do I submit the patch?

andrewkern commented 1 year ago

all you do is push commits to this same branch. looks like you've done that.

I'm still seeing errors though in trying to run the all rule on the Snakefile

ERROR:snakemake.logging:RuleException:
UnboundLocalError in file /home/adkern/analysis2/workflows/simulation.snake, line 88:
local variable 'dfe' referenced before assignment
  File "/home/adkern/analysis2/workflows/simulation.snake", line 88, in __rule_simulation
  File "/home/adkern/miniconda3/envs/analysis2/lib/python3.10/concurrent/futures/thread.py", line 58, in run
stsmall commented 1 year ago

right ... what is that error? I dont get it if I comment out dfe.snake. But I cant figure out why it is there ... for some reason it is not evaluating:

        if wildcards.dfes != "none":
            # Load dfe only if provided
            dfe = species.get_dfe(wildcards.dfes)

I checked the wildcards and even eval in the pdb ... but it keeps skipping that if statement which then leads to no value of dfe in:

        if wildcards.annots == "all_sites":
            # Adding selection to the whole contig
            contig.add_dfe(intervals=np.array([[0, int(contig.length)]]), DFE=dfe)
        elif wildcards.annots == "none":
            contig = species.get_contig(wildcards.chrms, genetic_map=genetic_map_id)
        else:
          ## Adding annotation only seletion on exon region
            annot = species.get_annotations(wildcards.annots)
            annot_intervals = annot.get_chromosome_annotations(wildcards.chrms)
            contig.add_dfe(intervals=annot_intervals, DFE=dfe)

I didnt alter this code in simulation.snake, it is the exact same as in Main. I am testing a fix right now, will update in a few

stsmall commented 1 year ago

so the error is associated with dfe.snake not looking for the zip(dfe, annot) addition I made to the simulation.snake. I can revert my code to not need the zip() or adjust the dfe.snake to use the zip()

not sure it makes sense to have: none/ensembl_havana_104_exons (which is what dfe.snake is expecting) ... unless I am missing something, what does no DFE but still annotations means for the analysis?

MissingInputException in rule dfe_generate_dadi_fs in file /home/ssmall/Documents/projects/stdpopsim/analysis2/workflows/dfe.snake, line 81:
Missing input files for rule dfe_generate_dadi_fs:
    output: /home/ssmall/Documents/projects/stdpopsim/analysis2/results/inference/OutOfAfrica_3G09/dadi/none/ensembl_havana_104_exons/1845187043/chr3/pop0/pop0.dadi.neu.fs, /home/ssmall/Documents/projects/stdpopsim/analysis2/results/inference/OutOfAfrica_3G09/dadi/none/ensembl_havana_104_exons/1845187043/chr3/pop0/pop0.dadi.nonneu.fs
    wildcards: demog=OutOfAfrica_3G09, dfes=none, annots=ensembl_havana_104_exons, seeds=1845187043, chrms=chr3, ids=0
    affected files:
        /home/ssmall/Documents/projects/stdpopsim/analysis2/results/simulated_data/OutOfAfrica_3G09/none/ensembl_havana_104_exons/1845187043/sim_chr3.trees
ERROR:snakemake.logging:MissingInputException in rule dfe_generate_dadi_fs in file /home/ssmall/Documents/projects/stdpopsim/analysis2/workflows/dfe.snake, line 81:Missing input files for rule dfe_generate_dadi_fs:
    output: /home/ssmall/Documents/projects/stdpopsim/analysis2/results/inference/OutOfAfrica_3G09/dadi/none/ensembl_havana_104_exons/1845187043/chr3/pop0/pop0.dadi.neu.fs, /home/ssmall/Documents/projects/stdpopsim/analysis2/results/inference/OutOfAfrica_3G09/dadi/none/ensembl_havana_104_exons/1845187043/chr3/pop0/pop0.dadi.nonneu.fs
    wildcards: demog=OutOfAfrica_3G09, dfes=none, annots=ensembl_havana_104_exons, seeds=1845187043, chrms=chr3, ids=0
    affected files:
        /home/ssmall/Documents/projects/stdpopsim/analysis2/results/simulated_data/OutOfAfrica_3G09/none/ensembl_havana_104_exons/1845187043/sim_chr3.trees
stsmall commented 1 year ago

based on the if/else eval ... I dont know how dfe=none annot=exons would not throw an error. You would drop to the else statement in the below code which doesnt have dfe defined since dfe = none.

        if wildcards.dfes != "none":
            # Load dfe only if provided
            dfe = species.get_dfe(wildcards.dfes)
        if wildcards.annots == "all_sites":
            # Adding selection to the whole contig
            contig.add_dfe(intervals=np.array([[0, int(contig.length)]]), DFE=dfe)
        elif wildcards.annots == "none":
            contig = species.get_contig(wildcards.chrms, genetic_map=genetic_map_id)
        else:
          ## Adding annotation only seletion on exon region
            annot = species.get_annotations(wildcards.annots)
            annot_intervals = annot.get_chromosome_annotations(wildcards.chrms)
            contig.add_dfe(intervals=annot_intervals, DFE=dfe)
stsmall commented 1 year ago

looking back through the configs ... I see that it is only ever:

dfe = ["Gamma_H17"] w/ annot =["exons", "all-sites"] OR dfe=["none"] so my config file is passing illegal values to dfe.snake

The zip() change I made allows for multiple dfe/annots to be run all in the same snakemake. comparing results of neutral sims and selection sims in the plot ... this was not possible before in the same snakefile dfe = [none, gamma_h17, gamma_h17] annot=[none, exons, all-sites] produces pairs: (none,none), (gamma_h17, exons), (gamma_h17, all-sites) ... without the rule breaking (none, exons), (none, all-sites)

stsmall commented 1 year ago

OK ... I think I fixed it. really very simple. I just had to remove "none" from dfe and annot lists for input into dfe.snake. Initially I tried to pass different configs, but couldnt get that to work. Thought that local configs over-wrote the global config params ... couldnt figure it out. So I just added a few lines to remove "none" from dfe_list and annotation_list before running dfe.snake.

so the only changes to dfe.snake: 1) change get_mask_file to get_mask_file_dfe 2) remove "none" from dfe_list and annotation_list in the top of the file

andrewkern commented 1 year ago

okay i just committed a change which fixes the bugs on the dadi side. I think we are there?

andrewkern commented 1 year ago

okay @stsmall this is looking ready to me. I wonder if anyone else wants to lay eyes on this PR before we merge? @chriscrsmith @lntran26?

chriscrsmith commented 1 year ago

If there is time I would look at it again tomorrow—otherwise go ahead and merge

lntran26 commented 1 year ago

Confirm that everything seems to be running smoothly for me. Not all jobs have finished running yet but the pipeline submits everything to the cluster and is running without errors for the simulation and N(t) side so far. I ran into some error with the DFE side (dfe_download_dfe_alpha and dfe_dadi_infer_dm: dadi-cli: error: unrecognized arguments: --cpus 8). I will inspect the results more closely once the run completes but think that we can merge.

chriscrsmith commented 1 year ago

@lntran26: and is this with the latest update on the dfe side?

andrewkern commented 1 year ago

@lntran26 I think this error is because of a dadi version error on your environment. the new environment requires - dadi==2.2.0 so you'll need to update that (or recreate the python env)

lntran26 commented 1 year ago

@chriscrsmith Yes. I pulled the latest version from main then switched to a new branch to test out this PR. @andrewkern I did update the env using the newest environment.yaml file and it definitely pulled dadi-cli from the repo as well as updated dadi to dadi==2.2.0. It seems like I'm still on the older version of dadi-cli somehow, hence the dadi-cli: error: unrecognized arguments: --cpus 8 (the old code used --threads). Maybe I need to recreate the env fresh since updating didn't fully do it.

andrewkern commented 1 year ago

Yes I'd guess something weird is happening in your env.

andrewkern commented 1 year ago

if there are no further updates from reviewers here I vote we merge-- any objections?

chriscrsmith commented 1 year ago

Do you want me to look through it today?

andrewkern commented 1 year ago

sure Chris!

lntran26 commented 1 year ago

Overall confirm that I was able to produce the plot that looks like what @stsmall showed! Perhaps my Gone run for YRI somehow looks a bit better? Except for the errorbar error I don't have any other issues.

estimated_Ne_t_final

chriscrsmith commented 1 year ago

What's the status on snakemake -c M all versus running the individual workflows?

I am getting a bunch of red with snakemake -c M all but maybe we were dealing with that later?

andrewkern commented 1 year ago

running all is working for a bunch of us. what sort of errors are you getting?

stsmall commented 1 year ago

i dont have time right now to go through all of @chriscrsmith changes, but many of his suggestions will break the n_t.snake code. I will attempt to get to them tonight

chriscrsmith commented 1 year ago

first, getting some files like analysis2/hs_err_pid109804.log due to insufficient memory. They don't help narrow down the specific rule

chriscrsmith commented 1 year ago
andrewkern commented 1 year ago

@chriscrsmith can you please post your conda env e.g., conda list

chriscrsmith commented 1 year ago

nuked conda and it got past the smcpp install :grimacing:

chriscrsmith commented 1 year ago

getting ooms with msmc on Oregon cluster—asking for more mem in the config helps.

Did anything change in the past week that would affect that?

andrewkern commented 1 year ago

perhaps, and I haven't tried to run the pipeline on the cluster since this PR

andrewkern commented 1 year ago

yep-- I've confirmed this OOM error on our system. I'll push a commit to n_t.snake that specifies the mem requirements for the run_msmc rule

chriscrsmith commented 1 year ago

I also noticed jobs were running past the 1 hr wall time. Increased the wall time, and my test run is still going from earlier today...

andrewkern commented 1 year ago

those same run_msmc jobs?

chriscrsmith commented 1 year ago

yeah. The snakemake command sort of idled. The logs showed msmc had finished 8/20 iterations or so

andrewkern commented 1 year ago

roger-- i can tack wall time on that rule too

chriscrsmith commented 1 year ago

mine just finished. I had set wall time to 300 as a test (observed time was 3hr 40min, including multiple batches of jobs)

andrewkern commented 1 year ago

300 hours?

chriscrsmith commented 1 year ago

minutes I assume. I didn't specify the units. —oops, 600 minutes!

default-resources:
        - time=600
andrewkern commented 1 year ago

okay i pushed a commit that tweaks the Oregon cluster profile, adds a mem_mb=24000 resource to run_msmc rule, and moved localrules to a higher level in the workflow.

I think at this point I'd like to merge this PR and harmonize the sweep side of things with what we have here.

chriscrsmith commented 1 year ago

sounds good!