nhoffman / dada2-nf

A Nextflow pipeline for processing 16S rRNA sequences using dada2
0 stars 2 forks source link

Orient reads after cutadapt #63

Closed crosenth closed 1 year ago

crosenth commented 1 year ago

@dhoogest @nhoffman

Based the on the discussion (and sorry I got distracted towards the end) the plan is to add the vsearch --orient after cutadapt but after that there sounds like there was a concern about the unmerged reads coming out of dada2. Is that right?

Dan, do you have a test set of reads that we can use to reproduce the adapter issue?

dhoogest commented 1 year ago

Here's my take on what we'd like to try:

This would have some implications on how counts are tallied obviously, which we didn't really discuss.

As far as test data, the test-its data set should have some suitable reads...if not I can add a couple files to that set as needed.

No need to do anything at present with the unmerged read files for now. The expectation would be that the content of these files will get more accurate if we re-orient upstream of dada2, which is probably as much as we should expect from this pipeline. How we handle them is sort of a consideration for the BLAST/classification steps downstream...

@nhoffman feel free to add/augment as needed

dhoogest commented 1 year ago

@crosenth @nhoffman one issue with this I uncovered today with some rudimentary testing is that in practice the reads 'off target' may not exactly match between R1 and R2 (like in the event that R2 quality prevents alignment for a read pair but R1 was over threshold). Seeing this behavior in some prelim tests of vsearch --orient, which then results in cardinality issues for downstream pipeline steps. There doesn't appear to be any handling for paired-end reads, so I think we'll need to roll our own logic in order to ensure that we retain the same seqs in the respective files (i.e. use the R1 file to test orientation, then split tags from R1 and R2 into files on the basis of the R1 orientation only)

crosenth commented 1 year ago

Pushed new branch 'plus_only':

https://github.com/nhoffman/dada2-nf/commit/a543dc7b2986f38998b29b39719c4540f6010be7

Basically I am taking only plus oriented sequences above 0.75 pident. Feel free to play with that threshold. Let me know if there are any questions or if I missed anything