Separate process Align_Fastq2Bam into two separate processes (bwa-mem2 and samtools)

uclahs-cds / pipeline-align-DNA

Nextflow pipeline to align paired DNA FASTQs and sort, mark duplicates, and index resulting alignment

https://uclahs-cds.github.io/pipeline-align-DNA/

GNU General Public License v2.0

4 stars 1 forks source link

Separate process Align_Fastq2Bam into two separate processes (bwa-mem2 and samtools) #111

Closed aakarsh-anand closed 3 years ago

aakarsh-anand commented 3 years ago

Process AlignFastq2Bam currently runs bwa-mem2 alignment and follows this with SAM conversion. These processes should be independent to be consistent with the 'best practices' for pipelines where each process does one thing.

See these lines: https://github.com/uclahs-cds/pipeline-align-DNA/blob/8fa311e95ff1ae55959db0df061f50f397f52f3e/pipeline/modules/align-dna/fastq2bam.nf#L33-L51

zhuchcn commented 3 years ago

You are right that in general as best practice each process should only handle one thing. But this is a special case, because BWA-MEM2 is very efficient in parallalization, and the plain text SAM output is too large. If we don't compress it along the way, the hard disk IO is going to limit its performance. So piping aligner output of SAM directly to samtools is a very commonly accepted stratagy.

pboutros commented 3 years ago

@zhuchcn Absolutely correct this is the right way to handle this case. But I think worth updating the documentation to make this explicit to readers how/why we made that decision (unless it's already documented of course!)

tyamaguchi-ucla commented 3 years ago

@zhuchcn @aakarsh-anand do we have the best practice doc on Confluence? We can modify the doc to add the exception. I agree. In general, we should do one thing per process but piping here will save us storage, time and cost so that's more important here.

I think Aakarsh is implementing HISAT2 and wanted to make sure how HISAT2 should run in align-DNA?

zhuchcn commented 3 years ago

@pboutros I think the best way to document this might be inline documentation? Because this isn't something that general users would run into, unless some one reads the code and wonder why it was implemented in this way. We don't have a standard for inline documentation yet but it is on the to-discuss list of the software development best practice working group @aholmes

pboutros commented 3 years ago

Yup I'm comfortable with whereever you might document. I would think exceptions should generally get documented in both code (inline) and overall docs with each referring to the other.

zhuchcn commented 3 years ago

@tyamaguchi-ucla Just updated our nextflow guide.

aakarsh-anand commented 3 years ago

@zhuchcn Thanks for the explanation. I'll open a request on the hisat2 docker for samtools to be added. Should I add comments in the pipeline explaining why this is done this way?

zhuchcn commented 3 years ago

@aakarsh-anand That sounds good!

jmlivingstone commented 3 years ago

I have created a new Dockerfile (https://github.com/uclahs-cds/docker-HISAT2/blob/initial_branch/align-DNA/Dockerfile) and DockerHub repo (https://hub.docker.com/repository/docker/blcdsdockerregistry/hisat2_samtools-1.11) for the HISAT2_SAMtools container. This is because adding SAMtools adds ~ 100 MB and it will save bandwidth for people running the align-RNA pipeline to not use this larger version, in which SAMtools is not needed.