harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
74 stars 33 forks source link

Unknown errors with big datasets #229

Open max-hence opened 2 weeks ago

max-hence commented 2 weeks ago

Hi,

I manage to make snpArcher work on dataset with medium size genomes (400Mb) but I got errors for bigger genomes (2Gb) and when job are taking to much time and ressources. I think I set the slurm/config.yaml properly to ask for big ressources and the cluster I m using is supposed to handle such settings but I got this kind of errors for instance at the bwa_map rule :

Error in rule bwa_map:
    message: SLURM-job '13562883' failed, SLURM status is: 'NODE_FAIL'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 252
    input: results/GCA_902167145.1/data/genome/GCA_902167145.1.fna, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.sa, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.pac, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.bwt, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.ann, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.amb, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.fai
    output: results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam, results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
    log: logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt, /scratch/mbrault/snpcalling/zmays_parviglumis_PRJNA641889/.snakemake/slurm_logs/rule_bwa_map/GCA_902167145.1_SAMN15515513_SRR12460375/13562883.log (check log file(s) for error details)
    conda-env: /scratch/mbrault/snpcalling/zmays_parviglumis_PRJNA641889/.snakemake/conda/8ca636c300f965c6ac864e051945e276_
    shell:
        bwa mem -M -t 8 -R '@RG\tID:6E8\tSM:SAMN15515513\tLB:6E8\tPL:ILLUMINA' results/GCA_902167145.1/data/genome/GCA_902167145.1.fna results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz 2> logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt | samtools sort -o results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam - && samtools index results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 13562883

And in the .snakemake/slurm_logs/rule_bwa_map/GCA_902167145.1_SAMN15515513_SRR12460375/13562883.log :

localrule bwa_map:
    input: results/GCA_902167145.1/data/genome/GCA_902167145.1.fna, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_1.fastq.gz, results/GCA_902167145.1/filtered_fastqs/SAMN15515513/SRR12460375_2.fastq.gz, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.sa, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.pac, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.bwt, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.ann, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.amb, results/GCA_902167145.1/data/genome/GCA_902167145.1.fna.fai
    output: results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam, results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.bai
    log: logs/GCA_902167145.1/bwa_mem/SAMN15515513/SRR12460375.txt
    jobid: 0
    benchmark: benchmarks/GCA_902167145.1/bwa_mem/SAMN15515513_SRR12460375.txt
    reason: Forced execution
    wildcards: refGenome=GCA_902167145.1, sample=SAMN15515513, run=SRR12460375
    threads: 32
    resources: mem_mb=100000, mem_mib=95368, disk_mb=43245, disk_mib=41242, tmpdir=/tmp, mem_mb_reduced=90000, slurm_partition=ecobio,genouest, slurm_account=mbrault, runtime=11520, cpus_per_task=32

Activating conda environment: .snakemake/conda/8ca636c300f965c6ac864e051945e276_

[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0000.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0001.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0002.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0003.bam" : File exists
[E::hts_open_format] Failed to open file "results/GCA_902167145.1/bams/preMerge/SAMN15515513/SRR12460375.bam.tmp.0004.bam" : File exists
etc...

But still when I look on the slurm cluster at that particular job I find no errors :

JobID           JobName      State    Elapsed     ReqMem     MaxRSS  MaxVMSize  AllocCPUS 
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- 
13562883     e5af4995-+  COMPLETED   08:26:43    100000M                               32 
13562883.ba+      batch  COMPLETED   08:26:43               129620K   5013688K         32 
13562883.ex+     extern  COMPLETED   08:26:43                  912K    144572K         32 
13562883.0   python3.11  COMPLETED   08:26:05             26104628K  33043144K         32 

Do you have any clue on what could cause such an error ? I joined the slurm/config.yaml if needed. config.yaml.txt

Thank you very much,

Max Brault

tsackton commented 2 weeks ago

It looks like the particular bwa_mem job you posted the log of is failing because there is an existing set of temp files from the samtools sort command, likely from a previous failed run that crashed before cleanup could finish. I would initially try deleting the "SRR12460375.bam.tmp.*.bam" files and rerunning.

This doesn't look like a slurm / resources error, although I'm not entirely sure why the error in the command is not being propagated to slurm.