torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Consequences of using vsearch on NovaSeq data #549

Closed slambrechts closed 1 month ago

slambrechts commented 6 months ago

Hi,

I know there are consequences of using dada2 on NovaSeq data (e.g. https://github.com/benjjneb/dada2/issues/791), but do you know if there are similar problems with using vsearch on novaseq data?

Best, Sam

frederic-mahe commented 6 months ago

@slambrechts you probably refer to NovaSeq's simplified quality encoding.

The short answer is: no known adverse effect yet.

Only marginal effects are known. For instance, vsearch may report fastq quality average or median values that do not belong to the reduced set of quality values. vsearch commands such as --fastq_mergepairs recompute quality values, and thus may be more impacted. Nothing showed up in our tests so far.

slambrechts commented 6 months ago

@frederic-mahe ok great, thank you for the info. If I understand correctly, there is also no need to adjust maxee for fastq filtering?

frederic-mahe commented 6 months ago

Earlier this year, I've listed the following reduced sets of quality values (see issue #474):

These are subsets of usual quality sets, so I do not expect any particular difficulties for vsearch.

Also no need to adjust maxee for fastq filtering?

When using --fastq_filter, --fastq_mergepairs or --fastx_filter, option --fastq_maxee discards sequences with an expected error greater than the specified value. There is no default value for --fastq_maxee, so there is no adjustment to be done on at the code level. Also, the way --fastq_maxee is computed (sum of 10^-(Q/10)) should not be impacted if a reduced set of quality values is used.

I could be wrong though, please feel free to suggest tests or configurations.

frederic-mahe commented 1 month ago

basic tests added to our test suite (https://github.com/frederic-mahe/vsearch-tests/commit/bd064e7bf942b7f9a83e3801041e58050342e4eb)