Eliminate bottlenecking of markdups

SPPearce commented 1 month ago

Description of feature

The pipeline currently seems to have a bottleneck at the alignment -> markdups step, where all the alignment has to be completed before any markdups processes will begin. The pipeline already uses groupKey to determine how many files should be expected from the splitting process, but this happens after the bwamem2 mapping step.

scwatts commented 1 month ago

I haven't been able to replicate the bottleneck as I understand from your description.

For some additional context, each MarkDups task must receive all BAMs for a given sample before starting to process and merge into a single output BAM. So blocking in that sense on a per-sample basis is intended and required. However, there should not be blocking/bottlenecking where all alignments must complete before any MarkDups process begins.

I've run oncoanalyser in stub mode and added an artificial 60 second delay to one sample in the bwa-mem2 process to evaluate flow through the NF channels. As expected, all MarkDups tasks run as soon as each set of sample BAMs become available (see attached timeline and below expandable to replicate).

If you're seeing different behaviour, could you please provide some additional details of your observations and how you're running oncoanalyser?

Attachment: execution_timeline_2024-08-05_12-36-17.html.gz

oncoanalyser bwa-mem2/MarkDups data flow check (click to show)

Get and patch oncoanalyser with an artificial 60 second delay in bwa-mem2 for the 'sa.tumor' sample ```bash git clone https://github.com/nf-core/oncoanalyser (cd oncoanalyser/ && git checkout 41010dd) cat < alignment-delay.patch --- a/oncoanalyser/modules/local/bwa-mem2/mem/main.nf +++ b/oncoanalyser/modules/local/bwa-mem2/mem/main.nf @@ -64,6 +64,10 @@ process BWAMEM2_ALIGN { """ + if [[ \${meta.sample_id} == 'sa.tumor' ]]; then + sleep 60; + fi + touch \${output_fn} touch \${output_fn}.bai EOF patch -lp1 < alignment-delay.patch ``` Create samplesheet ```bash cat < samplesheet.csv group_id,subject_id,sample_id,sample_type,sequence_type,filetype,info,filepath sa_debug,sa,sa.normal,normal,dna,fastq,library_id:sa.normal.lb;lane:1,$(pwd)/temp/sa.normal.R1.fastq.gz;$(pwd)/temp/sa.normal.R2.fastq.gz sa_debug,sa,sa.tumor,tumor,dna,fastq,library_id:sa.tumor.lb;lane:1,$(pwd)/temp/sa.tumor.R1.fastq.gz;$(pwd)/temp/sa.tumor.R2.fastq.gz sb_debug,sb,sb.normal,normal,dna,fastq,library_id:sb.normal.lb;lane:1,$(pwd)/temp/sb.normal.R1.fastq.gz;$(pwd)/temp/sb.normal.R2.fastq.gz sb_debug,sb,sb.tumor,tumor,dna,fastq,library_id:sb.tumor.lb;lane:1,$(pwd)/temp/sb.tumor.R1.fastq.gz;$(pwd)/temp/sb.tumor.R2.fastq.gz EOF ``` Create local configuration ```bash cat < stub.config params { genomes { 'GRCh38_hmf' { fasta = "$(pwd)/temp/GRCh38.fasta" fai = "$(pwd)/temp/GRCh38.fai" dict = "$(pwd)/temp/GRCh38.dict" bwamem2_index = "$(pwd)/temp/GRCh38_bwa-mem2_index/" gridss_index = "$(pwd)/temp/GRCh38_gridss_index/" star_index = "$(pwd)/temp/GRCh38_star_index/" } } ref_data_virusbreakenddb_path = '$(pwd)/temp/virusbreakenddb_20210401/' ref_data_hmf_data_path = '$(pwd)/temp/hmf_bundle_38/' ref_data_panel_data_path = '$(pwd)/temp/panel_bundle/tso500_38/' } EOF ``` Run oncoanalyser ```bash nextflow run -config stub.config oncoanalyser/main.nf \ \ -stub \ --create_stub_placeholders \ \ --max_cpus 1 \ --max_memory 1.GB \ \ --mode wgts \ --genome GRCh38_hmf \ --input samplesheet.csv \ --outdir output_stub/ ```

scwatts commented 3 days ago

Closing the issue but please re-open if you'd like to discuss further!

SPPearce commented 2 days ago

Closing the issue but please re-open if you'd like to discuss further!

Ah, completely forgot about this one, been busy with other bits ATM.

nf-core / oncoanalyser

Eliminate bottlenecking of markdups #74

Description of feature