Closed aakarsh-anand closed 3 years ago
You are right that in general as best practice each process should only handle one thing. But this is a special case, because BWA-MEM2 is very efficient in parallalization, and the plain text SAM output is too large. If we don't compress it along the way, the hard disk IO is going to limit its performance. So piping aligner output of SAM directly to samtools is a very commonly accepted stratagy.
@zhuchcn Absolutely correct this is the right way to handle this case. But I think worth updating the documentation to make this explicit to readers how/why we made that decision (unless it's already documented of course!)
@zhuchcn @aakarsh-anand do we have the best practice doc on Confluence? We can modify the doc to add the exception. I agree. In general, we should do one thing per process but piping here will save us storage, time and cost so that's more important here.
I think Aakarsh is implementing HISAT2 and wanted to make sure how HISAT2 should run in align-DNA?
@pboutros I think the best way to document this might be inline documentation? Because this isn't something that general users would run into, unless some one reads the code and wonder why it was implemented in this way. We don't have a standard for inline documentation yet but it is on the to-discuss list of the software development best practice working group @aholmes
Yup I'm comfortable with whereever you might document. I would think exceptions should generally get documented in both code (inline) and overall docs with each referring to the other.
@tyamaguchi-ucla Just updated our nextflow guide.
@zhuchcn Thanks for the explanation. I'll open a request on the hisat2 docker for samtools to be added. Should I add comments in the pipeline explaining why this is done this way?
@aakarsh-anand That sounds good!
I have created a new Dockerfile (https://github.com/uclahs-cds/docker-HISAT2/blob/initial_branch/align-DNA/Dockerfile) and DockerHub repo (https://hub.docker.com/repository/docker/blcdsdockerregistry/hisat2_samtools-1.11) for the HISAT2_SAMtools container. This is because adding SAMtools adds ~ 100 MB and it will save bandwidth for people running the align-RNA pipeline to not use this larger version, in which SAMtools is not needed.
Process AlignFastq2Bam currently runs bwa-mem2 alignment and follows this with SAM conversion. These processes should be independent to be consistent with the 'best practices' for pipelines where each process does one thing.
See these lines: https://github.com/uclahs-cds/pipeline-align-DNA/blob/8fa311e95ff1ae55959db0df061f50f397f52f3e/pipeline/modules/align-dna/fastq2bam.nf#L33-L51