Erythroxylum commented 10 months ago

Hello, I am running snakemake on the Harvard FASRC shared partition and getting an error at the step: [Fri Aug 4 20:33:02 2023] checkpoint create_db_intervals: input: results/GCA_026413385.1/data/genome/GCA_026413385.1.fna, results/GCA_026413385.1/data/genome/GCA_026413385.1.fna.fai, results/GCA_026413385.1/data/genome/GCA_026413385.1.dict, results/GCA_026413385.1/intervals/master_interval_list.list output: results/GCA_026413385.1/intervals/db_intervals/intervals.txt, results/GCA_026413385.1/intervals/db_intervals log: logs/GCA_026413385.1/db_intervals/log.txt jobid: 2 benchmark: benchmarks/GCA_026413385.1/db_intervals/benchmark.txt reason: Missing output files: results/GCA_026413385.1/intervals/db_intervals/intervals.txt; Input files updated by another job: results/GCA_026413385.1/intervals/master_interval_list.list, results/GCA_026413385.1/data/genome/GCA_026413385.1.fna.fai, results/GCA_026413385.1/data/genome/GCA_026413385.1.fna, results/GCA_026413385.1/data/genome/GCA_026413385.1.dict wildcards: refGenome=GCA_026413385.1 resources: mem_mb=1197, mem_mib=1142, disk_mb=1197, disk_mib=1142, tmpdir= DAG of jobs will be updated after completion.

Submitted job 2 with external jobid '65339256'. [Fri Aug 4 20:33:42 2023] Error in rule create_db_intervals: jobid: 2 input: results/GCA_026413385.1/data/genome/GCA_026413385.1.fna, results/GCA_026413385.1/data/genome/GCA_026413385.1.fna.fai, results/GCA_026413385.1/data/genome/GCA_026413385.1.dict, results/GCA_026413385.1/intervals/master_interval_list.list output: results/GCA_026413385.1/intervals/db_intervals/intervals.txt, results/GCA_026413385.1/intervals/db_intervals log: logs/GCA_026413385.1/db_intervals/log.txt (check log file(s) for error details) conda-env: /n/holyscratch01/davislab/dwhite/snpArcher/.snakemake/conda/abb557a3ad4a64770d7de92755c7727c shell:

    gatk SplitIntervals -L results/GCA_026413385.1/intervals/master_interval_list.list         -O results/GCA_026413385.1/intervals/db_intervals -R results/GCA_026413385.1/data/genome/GCA_026413385.1.fna -scatter 270         -mode INTERVAL_SUBDIVISION         --interval-merging-rule OVERLAPPING_ONLY &> logs/GCA_026413385.1/db_intervals/log.txt
    ls -l results/GCA_026413385.1/intervals/db_intervals/*scattered.interval_list > results/GCA_026413385.1/intervals/db_intervals/intervals.txt

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
cluster_jobid: 65339256

Error executing rule create_db_intervals on cluster (jobid: 2, external: 65339256, jobscript: /n/holyscratch01/davis_lab/dwhite/snpArcher/.snakemake/tmp.viry2rtf/snakejob.create_db_intervals.2.sh). For error details see the cluster log and the log files of the involved rule(s). checkpoint create_db_intervals.

There is no output to the folder results/genome/intervals/db_intervals The master_interval_list.list has been created. The log file is attached - the slurm parameters and command are printed and then it is blank. When I run the java command as it is printed in the log file, all scattered.interval_list files are created. log.txt

cademirch commented 10 months ago

Hey, sorry you're having trouble with the workflow. I'm not sure what's going on, but because the GATK tool isn't printing progress to the log, I'm inclined to think that the job might be getting killed by slurm for over using resources? Unfortunately I'm not super familiar with slurm.

Though, I do have access to a slurm cluster now, so if you share your config.yaml and sample sheet I can try to recreate this.

Erythroxylum commented 10 months ago

Hi Cade, Thanks for your help. I agree it is sensible that it was killed somehow. Unfortunately all the slurm commands are pretty new to me also, so I am attaching here the sample sheet, config.yaml (.txt), and the run_pipeline.sh, as well as the slurm config files.

samples_test10.csv config.yaml.txt resources.yaml.txt run_pipeline.sh.txt slurm-cluster_config.yml.txt slurm-config.yaml.txt

cademirch commented 10 months ago

Okay thanks, I'll take a look when I'm back to my computer in a few hours.

In the meantime, you can check the SLURM logs for the job at the directory where you specified in the slurm config yaml file.

Erythroxylum commented 10 months ago

Here is the slurm log: cat slurm-65339256.out Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Provided resources: mem_mb=1197, mem_mib=1142, disk_mb=1197, disk_mib=1142 Select jobs to execute...

[Fri Aug 4 20:33:24 2023] checkpoint create_db_intervals: input: results/GCA_026413385.1/data/genome/GCA_026413385.1.fna, results/GCA_026413385.1/data/genome/GCA_026413385.1.fna.fai, results/GCA_026413385.1/data/genome/GCA_026413385.1.dict, results/GCA_026413385.1/intervals/master_interval_list.list output: results/GCA_026413385.1/intervals/db_intervals/intervals.txt, results/GCA_026413385.1/intervals/db_intervals log: logs/GCA_026413385.1/db_intervals/log.txt jobid: 0 benchmark: benchmarks/GCA_026413385.1/db_intervals/benchmark.txt reason: Missing output files: benchmarks/GCA_026413385.1/db_intervals/benchmark.txt, results/GCA_026413385.1/intervals/db_intervals/intervals.txt, results/GCA_026413385.1/intervals/db_intervals wildcards: refGenome=GCA_026413385.1 resources: mem_mb=1197, mem_mib=1142, disk_mb=1197, disk_mib=1142, tmpdir=/tmp DAG of jobs will be updated after completion.

Activating conda environment: .snakemake/conda/abb557a3ad4a64770d7de92755c7727c_ [Fri Aug 4 20:33:39 2023] Error in rule create_db_intervals: jobid: 0 input: results/GCA_026413385.1/data/genome/GCA_026413385.1.fna, results/GCA_026413385.1/data/genome/GCA_026413385.1.fna.fai, results/GCA_026413385.1/data/genome/GCA_026413385.1.dict, results/GCA_026413385.1/intervals/master_interval_list.list output: results/GCA_026413385.1/intervals/db_intervals/intervals.txt, results/GCA_026413385.1/intervals/db_intervals log: logs/GCA_026413385.1/db_intervals/log.txt (check log file(s) for error details) conda-env: /n/holyscratch01/davislab/dwhite/snpArcher/.snakemake/conda/abb557a3ad4a64770d7de92755c7727c shell:

    gatk SplitIntervals -L results/GCA_026413385.1/intervals/master_interval_list.list         -O results/GCA_026413385.1/intervals/db_intervals -R results/GCA_026413385.1/data/genome/GCA_026413385.1.fna -scatter 270         -mode INTERVAL_SUBDIVISION         --interval-merging-rule OVERLAPPING_ONLY &> logs/GCA_026413385.1/db_intervals/log.txt
    ls -l results/GCA_026413385.1/intervals/db_intervals/*scattered.interval_list > results/GCA_026413385.1/intervals/db_intervals/intervals.txt

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job create_db_intervals since they might be corrupted: results/GCA_026413385.1/intervals/db_intervals Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message slurmstepd: error: Detected 1 oom-kill event(s) in StepId=65339256.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

cademirch commented 10 months ago

Ah there ya go, looks like slurm killed the job for memory. You can up the memory in the resource yaml file.

Erythroxylum commented 10 months ago

Yes. so the resource file allocates mem = 5000 If it needs 1197, why is it being killed? Or I guess the question is, how much do I need? This is a test of 10 samples out of 140, if that matters.

tsackton commented 10 months ago

It looks like the resources.yaml file has the job misnamed. The job is called create_db_intervals not create_intervals.

Can you try changing create_intervals in resources.yaml to create_db_intervals? If it still fails, try increasing mem to 10000.

Erythroxylum commented 10 months ago

OK, preparing to do that. Should I now run run_pipeline_update.sh ??

tsackton commented 10 months ago

no, just run the pipeline again, it will pick up from the failed job

Erythroxylum commented 10 months ago

I reran with the revised resources.yaml and that step appears to have completed successfully: the db_intervals/intervals.txt file has been created. However, now the job was killed for exceeding memory during create_gvcf_intervals. I added a similar line as before in the resources.yaml for this with mem:5000, but that failed, and I also tried with the original, "create_intervals: mem: 5000" but this also failed at rule "compute_d4" (err and log files attached). The output files for compute_d4 are missing. The only file in the callable_sites folder is results/GCA_026413385.1/callable_sites/test10_callable_sites_map.bed err.txt PREP0329_DWhit15901A_A02v1_grac_1465_S18_L004.txt

cademirch commented 10 months ago

@Erythroxylum I've fixed the resource issue which will be merged soon. The mosdepth issue I can replicate and am looking into now, thanks for bringing it to attention.

tsackton commented 10 months ago

@Erythroxylum these should now be fixed, if you can pull the latest version of snpArcher and let us know if you are still having problems that would be great! Thanks for using snpArcher and reporting problems

117 resolves mosdepth issue; #116 resolves resources issue.

harvardinformatics / snpArcher

Error in rule create_db_intervals #114

117 resolves mosdepth issue; #116 resolves resources issue.