uclahs-cds / pipeline-align-DNA

Nextflow pipeline to align paired DNA FASTQs and sort, mark duplicates, and index resulting alignment
https://uclahs-cds.github.io/pipeline-align-DNA/
GNU General Public License v2.0
4 stars 1 forks source link

Adding fastp as a configurable option #272

Open raagagrawal opened 1 year ago

raagagrawal commented 1 year ago

I believe that fastp would improve the current align-DNA pipeline and workflow.

fastp is an all-in-one FASTQ preprocessor. It performs read filtering, base correction, quality control, and adapter trimming. It also produces a variety of QC plots that can be used to make decisions around sample inclusion/exclusion in further analysis.

Currently, fastp is only offered in the align-RNA pipeline, where I find it is very useful in reducing time spent running the software seperately. Offering fastp as a configurable option in align-DNA would create feature parity between the pipelines and also save users significant time and storage.

Today, I run fastp before align-DNA runs and store a seperate set of fastq files on top of the ones already registered. Multiplied across many projects this can become non-negligible, and save the lab storage space if adapter trimming were done as part of the pipeline and trimmed fastqs were deleted each time a run concluded.

tyamaguchi-ucla commented 1 year ago

For QC, yes we will be developing sample- and cohort-level QC pipelines.

For hard-clipping, aligners typically perform soft-clipping on reads contaminated by adapters. Given the potential compute and storage costs, I don't think we would need this option for most of our datasets although it would be helpful to see benchmarking results in the context of the compute costs and downstream data accuracy.