nhoffman / dada2-nf

A Nextflow pipeline for processing 16S rRNA sequences using dada2
0 stars 2 forks source link

Preserve reads filtered out at dada2 filterAndTrim? #87

Open marykstewart opened 3 months ago

marykstewart commented 3 months ago

I think we need to revisit our trimming parameters for ThruPlex NGS16S libraries (more an issue for the clampi pipeline than this one). If it's straightforward to do, it might be helpful to preserve the reads filtered out at filterAndTrim for troubleshooting. I thought there was a flag in dada2 that would do this, but I didn't see one on re-reading the docs so this would not be nearly as simple as I was thinking, and depending upon the effort involved might not be worth it (ie, watching what remains may serve just as well). I talked with Chris and Noah about this last week, and said I'd follow up in github, but from my perspective no need to jump on this right away.

Chris, the lines I'd commented out to preserve the work/ directory are 742-746 https://gitlab.labmed.uw.edu/molmicro/clampi-ngs/-/blob/dev/SConstruct. Doing so still preserves the directory, but it looks to be empty now.

The Background:
We're observing significant read loss at the dada2 filter with ThruPlex NGS16S library preps, compare this ThruPlex/standard pair: https://share.labmed.uw.edu/molmicro/markergene/24N0316_SRS16S/report/
https://share.labmed.uw.edu/molmicro/markergene/24N0300_NGS16S/report/

This is likely due to minimum length requirements we've set in the dada2 parameters. When I spoke with Chris and Noah last week, I was somewhat mystified because I'd done an analysis where I reduced minLen from 100 to 20, but the filtering numbers didn't budge https://gitlab.labmed.uw.edu/molmicro/ops/-/issues/1912#note_127852.

I read the dada2 docs again and realized that while we do have a minimum length of 100 set in the pipeline, it's not actually doing anything b/c it operates after truncLen. We have truncLen set to R1=240 and R2=220 (https://gitlab.labmed.uw.edu/molmicro/clampi-ngs/-/blob/dev/data/dada_params_16S.json), and reads shorter than truncLen-trimLeft(=20) will be discarded (https://www.bioconductor.org/packages/release/bioc/manuals/dada2/man/dada2.pdf). So heavy losses at filterAndTrim are not surprising, given the adapter positions in reads of ThruPlex NGS16S libraries, see examples https://share.labmed.uw.edu/molmicro/markergene/24N0316_SRS16S/24R215_fastqc/multiqc_report.html#fastqc_adapter_content
https://share.labmed.uw.edu/molmicro/markergene/24N0315_SRS16S/fastqc_24R213/multiqc_report.html#fastqc_adapter_content
https://share.labmed.uw.edu/molmicro/markergene/24N0329_SRS16S/fastqc_24R224/multiqc_report.html#fastqc_adapter_content

crosenth commented 2 months ago

Hi @marykstewart , I have a PR to include filter_and_trim filtered and dropped sequences in the dada2-nf output. I will coordinate with @dhoogest to make sure they are be available for you in 16S pipeline to analyze

crosenth commented 2 months ago

New 2.0.3 release - https://github.com/nhoffman/dada2-nf/compare/2.0.2...2.0.3