Rework vsearch_svs process to ensure that clustering only occurrs within a sampleid rather than across all sequences

nhoffman / dada2-nf

A Nextflow pipeline for processing 16S rRNA sequences using dada2

0 stars 2 forks source link

Rework vsearch_svs process to ensure that clustering only occurrs within a sampleid rather than across all sequences #65

Closed dhoogest closed 1 year ago

dhoogest commented 1 year ago

@nhoffman @crosenth this approach would seem to address our intended use of the current vsearch_svs step, which I think is supposed to facilitate the combination of 'reverse' oriented SVs with the complementary forward 'passed' seqs. Change should address the problem described here, where an inaccurate sv clustering created the appearance of an SV in samples where no association was expected or seen prior to clustering.

We should probably spell out the desired logic fully as part of implementing this change.

dhoogest commented 1 year ago

@nhoffman recommends changing logic to iterate over specimens and vsearch with global pairwise aln on the shorter of the pair.

crosenth commented 1 year ago

Reverse and forward reads are now grouped by specimen before clustering. Also note I also made the --iddef 0 update.

https://github.com/nhoffman/dada2-nf/commit/c85f5d464b6a53cdbfc031398c6f5a2d2ce3e6fc#diff-6401496ba455b9488ffa902a6e4d7732b2c60ff2d77c5c3ef96b28a7ac7d3b28R580

crosenth commented 1 year ago

Also note the new counts.csv file. I also put barcodecop as the first step in the pipeline to sort out the index file(s) first to avoid compounding the if/else logic further down in the pipeline with the different index file variations.

crosenth commented 1 year ago

https://github.com/nhoffman/dada2-nf/blob/master/CHANGES.md#119