snakemake-workflows / dna-seq-gatk-variant-calling

This Snakemake pipeline implements the GATK best-practices workflow
MIT License
242 stars 147 forks source link

Haplotype caller with intervals runs slow #26

Closed vetsuisse-unibe closed 4 years ago

vetsuisse-unibe commented 4 years ago

Hi, I've recently started using Snakemake for my variant calling pipeline. We use Slurm at our facility. I observed that the jobs launched via Snakemake are running much slower than when I run them manually. Is there any reason? My Snakemake file content is here:

SAMPLES=["K350_", "K443_", "K450_", "K452_", "K469_", "K471_", "K472_", "K473_", "K484_", "K485_"]
intervals=["0000", "0001", "0002", "0003", "0004", "0005", "0006", "0007", "0008", "0009", "0010", "0011", "0012", "0013", "0014", "0015", "0016", "0017", "0018", "0019", "0020", "0021", "0022", "0023"]
REF = '/data/references_deprecated/cat/felCat9/bwaIndex/felCat9.fa'

def get_intervals(wildcards):
    interval=wildcards.interval
    intervalFile="interval-files-folder/" + interval + "-scattered.interval_list"
    return intervalFile

rule all:
    input:
        expand('data/IntervalGvcf/{sample}.{interval}.g.vcf.gz',sample=SAMPLES,interval=intervals)

rule GVCF:
    input:
        ref = REF,
        bam = 'data/dedup_bams/{sample}.dedup.bam'
    output: 
        vcf='data/IntervalGvcf/{sample}.{interval}.g.vcf.gz',
        index='data/IntervalGvcf/{sample}.{interval}.g.vcf.gz.tbi'
    log:
        'logs/intervalGVCF/{sample}_{interval}.log'
    params:
         intr=get_intervals
    threads: 8
    shell:
        '''
        module load vital-it;
        module add UHTS/Analysis/GenomeAnalysisTK/4.1.3.0;
        java -Xmx4G -Djava.io.tmpdir=/scratch/tmp_vj/  -XX:+UseParallelGC -XX:ParallelGCThreads=1 -jar $GATK_PATH/bin/GenomeAnalysisTK.jar HaplotypeCaller  -R {REF}  -I {input.bam} -pairHMM AVX_LOGLESS_CACHING --native-pair-hmm-threads 8 -L {params.intr} -O {output.vcf} -ERC GVCF -stand-call-conf 10 2> {log}
        '''
vetsuisse-unibe commented 4 years ago

I have run the above pipeline on SLURM HPC cluster in the following manner snakemake -s bqsrSnakefile --printshellcmds --drmaa " --ntasks=1 --mem=5000 --mincpus=8 --time=24:00:00" --latency-wait 300 --jobs 100 --jobname pXXX_{jobid}

johanneskoester commented 4 years ago

Such a questions is best asked to the slurm experts on stack overflow. Tag is with the snakemake tag: https://stackoverflow.com/questions/tagged/snakemake

I can only say here that snakemake will simply run your shell command inside of a cluster job. Maybe the parametrization of the cluster job is not optimal?

Apart from that, this rule has various other issues. Portability is not there (because it uses modules) scalability is hampered (because you hardcode threads in the shell command instead of using {threads} to get whatever is defined in the rule itself. Have a look at this repo, which provides a gold standard GATK best practice workflow as it makes sense in snakemake.