nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
410 stars 417 forks source link

GATK4_HAPLOTYPECALLER not running in parallel #1617

Open sitems opened 3 months ago

sitems commented 3 months ago

Description of the bug

As a minimal example, I am locally (on a system with 24 cores and 128 GB RAM) running joint germline with just two WES samples with this nextflow.config file

process { withName: 'FASTP' {cpus = 16 } withName: 'BWAMEM1_MEM|BWAMEM2_MEM' { cpus = { cpus = 22 } memory = 100.GB } withName: 'GATK4_HAPLOTYPECALLER' {
cpus = 20 memory = 120.GB
} }

For simplicity, as --intervals, I am using igenomes WGS wgs_calling_regions_noseconds.hg38.bed bed file (same file as sarek is using). I know that I should use special exon intervals, but If I understand it, this WGS intervals should also provide some kind of parallelisation (and when I tried exon intervals before, the parallelisation problem was the same).

When checking htop, FASTP and BWA are utilizing multiple cores, but haplotypecaller not (just 1-2 cores). Why?

Command used and terminal output

nextflow run nf-core/sarek -r 3.4.3 -profile docker -config $INPUT/nextflow.config -work-dir $OUT/workdir --joint_germline --wes --intervals $INPUT/wgs_calling_regions_noseconds.hg38.bed --trim_fastq --genome GATK.GRCh38 --input $INPUT/test_samplesheet.csv --outdir $OUT/output --tools haplotypecaller --max_memory 130.GB --max_cpus 23 --skip_tools bcftools,fastqc,haplotypecaller_filter,haplotyper_filter,markduplicates,markduplicates_report,mosdepth,multiqc --aligner bwa-mem2

Relevant files

nextflow.zip

System information

Nextflow version: 24.04.4.5917 Hardware: Desktop PC, 24 cores, 128GB RAM Executor: local Container engine: Docker OS: Ubuntu 22.04 Version of nf-core/sarek: 3.4.{2,3}

FriederikeHanssen commented 3 months ago

task.cpus is not set here: https://github.com/nf-core/modules/blob/3e403b703c04d4af6bddb4f0b03b772b7365ffc0/modules/nf-core/gatk4/haplotypecaller/main.nf#L42

Do you know which Haplotypecaller tool parameter would enable that?

sitems commented 3 months ago

I have also tried not setting those config parameters and a lot of other things, but the problem is still the same. In no way I can achieve haplotypecaller parallelisation. If I understand it correctly, providing intervals should cause haplotypecaller to run on those intervals in parallel. Or am I wrong?

FriederikeHanssen commented 3 months ago

Depends on what you mean here. The intervals will allow sarek to spin up a bunch of independent haplotypecaller jobs. Then each of those could use one or more threads.

From your description I assumed the latter is not working as you expect. For that in general each tool as a parameter set that let's you specify the number of cpus for that particular job. I can see that in the Haplotypecaller module this is not set and I am not sure which of the Haplotypecaller parameters would correspond to that: https://gatk.broadinstitute.org/hc/en-us/articles/27007962724507-HaplotypeCaller

sitems commented 3 months ago

Thank you for the response. I meant the first thing, to "to spin up a bunch of independent haplotypecaller jobs", but in htop, I do not see any parallelisation. I first run the pipeline on 40 samples without any --intervals - It took 5 days, and most of that time it was running haplotypecaller on 1-2 cores. That is why I decided to experiment with many things/settings/alternatives but I still cannot achieve any parallelisation. So how can I speed up haplotypecaller part of pipeline if I have 24 cores?

FriederikeHanssen commented 3 months ago

These process-level resource requests you showed are done on a per job basis.

If you request 20 CPUS for one job, those are requested and blocked by Nextflow for a single job and another job requesting the same resources won't have space resulting in one Haplotypecaller job being submitted after the other. Have you tried requesting fewer?

sitems commented 3 months ago

Yes, I have also tried using no custom nextflow.config at all (so only defaults, and '--max_cpus 23' parameter from CLI), but the same problem - no parallelisation.

FriederikeHanssen commented 3 months ago

How much memory have you been requesting for the jobs? If you want to do small test you could set

withName: 'GATK4_HAPLOTYPECALLER' {
cpus = 2
memory = 2.GB
}

This will likely fail with OOM but not the point here. also for testing it might make sense to remove all the other tools. This should reduce the over all number of jobs that the pipeline is submitting. Since if other jobs are submitted and using up resources it will also appear as if things are iterative. You can also check the produced timeline to see when which job has become available

FriederikeHanssen commented 3 months ago

Hey! Has this been resolved?

sitems commented 3 months ago

Hi Friederike, not yet, but I'm working on it, so I will let you know.

sitems commented 1 month ago

Finally, these parameters work best for me, haplotypecalller is using multiple cores now:

withName: 'GATK4_HAPLOTYPECALLER' {
cpus = 1 memory = 20.GB time = 30.h ext.args = { "--native-pair-hmm-threads 1 -ERC GVCF" }
}

We can close the issue.