nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
911 stars 705 forks source link

Save transcriptome BAM files when using --save_umi_intermeds / --save_align_intermeds #931

Closed pcantalupo closed 1 year ago

pcantalupo commented 1 year ago

Description of the bug

I’m using v3.9 with --save_umi_intermeds and --save_align_intermeds to save all BAM files. After pipeline finishes, I find these BAM files in results

$ find results -type f -name '*bam' | cut -f 3 -d/ | sort
22-0283.Aligned.out.bam
22-0283.Aligned.toTranscriptome.out.bam
22-0283.markdup.sorted.bam
22-0283.sorted.bam
22-0283.umi_dedup.sorted.bam

However, SALMON_QUANT uses the BAM file 22-0283.umi_dedup.transcriptome.filtered.bam (see below). Shouldn’t it be copied to results/ when using those intermeds parameters?

$ cat work/86/65912548cbefe7d020967895bb8837/.command.sh
#!/bin/bash -ue
salmon quant \
    --geneMap Mus_musculus.GRCm39.105.gtf \
    --threads 6 \
    --libType=A \
    -t genome.transcripts.fa \
    -a 22-0283.umi_dedup.transcriptome.filtered.bam \   <========= BAM file not saved to `results/` 
    -o 22-0283

Additionally, I found 4 bam files in work/ that are not copied to results. Is this a design choice even when using the intermeds parameters?

$ find work -type f -name '*bam' | cut -f 4 -d/ | sort
22-0283.Aligned.out.bam
22-0283.Aligned.toTranscriptome.out.bam
22-0283.markdup.sorted.bam
22-0283.sorted.bam
22-0283.transcriptome.sorted.bam                    <=== NEW
22-0283.umi_dedup.sorted.bam
22-0283.umi_dedup.transcriptome.bam              <=== NEW
22-0283.umi_dedup.transcriptome.filtered.bam   <=== NEW
22-0283.umi_dedup.transcriptome.sorted.bam   <=== NEW

Thank you

Command used and terminal output

nextflow run /ihome/crc/install/genomics_nextflow/nf-core-rnaseq-3.9/workflow \
 --fasta /bgfs/projects/refs/ENSEMBL/mouse/Mus_musculus.GRCm39.dna.primary_assembly.fa \
 --gtf /bgfs/projects/refs/ENSEMBL/mouse/Mus_musculus.GRCm39.105.gtf \
 --save_reference \
 --save_trimmed --save_umi_intermeds --save_align_intermeds \
 --pseudo_aligner salmon --salmon_quant_libtype A \
 --with_umi \
 --umitools_extract_method "regex" \
 --umitools_bc_pattern2 "^(?P<umi_1>.{8})(?P<discard_1>.{6}).*" \
 --input samplesheet.csv --outdir results \
 -profile htc \
 -resume

Relevant files

No response

System information

pipeline version: 3.9 nextflow 22.04.3 singularity 3.9.6 slurm on HPC Linux

drpatelh commented 1 year ago

Fixed in https://github.com/nf-core/rnaseq/commit/07ddd71401b951eb71f90be8915ccca76dfcd0b1

Hi @pcantalupo ! This was intentional behaviour as I didn't see the point in publishing "too" many BAM files, especially since this is configurable now via a custom config. However, I have changed the default behaviour now to publish everything just in case these files are actually useful to debug stuff. I think I got all of them so please have a check after the next release and lemme know if I missed anything. (when using --save_umi_intermeds --save_align_intermeds --with_umi)

$ find ./ -type f -name "*.bam" | sort | grep "WT_REP1"
./star_salmon/WT_REP1.Aligned.out.bam
./star_salmon/WT_REP1.Aligned.toTranscriptome.out.bam
./star_salmon/WT_REP1.sorted.bam
./star_salmon/WT_REP1.transcriptome.sorted.bam
./star_salmon/WT_REP1.umi_dedup.sorted.bam
./star_salmon/WT_REP1.umi_dedup.transcriptome.bam
./star_salmon/WT_REP1.umi_dedup.transcriptome.filtered.bam
./star_salmon/WT_REP1.umi_dedup.transcriptome.sorted.bam

$ find ./ -type f -name "*.bai" | sort | grep "WT_REP1"
./star_salmon/WT_REP1.sorted.bam.bai
./star_salmon/WT_REP1.transcriptome.sorted.bam.bai
./star_salmon/WT_REP1.umi_dedup.sorted.bam.bai
./star_salmon/WT_REP1.umi_dedup.transcriptome.sorted.bam.bai
pcantalupo commented 1 year ago

I checked release 3.11.1 and it looks good. Thank you

drpatelh commented 1 year ago

Great! Thanks for confirming!