torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

forward read trimming and filtering (Minardi et al. 2021) #534

Closed frederic-mahe closed 9 months ago

frederic-mahe commented 9 months ago

Lasse Krøger Eliassen asked about the correct way to implement forward read trimming and filtering, as described in Minardi et al. 2021.

Forward reads were trimmed to 200 bp in length approximately corresponding to the point at which the lower quartile fell below 20. Low quality reads were removed when estimated errors were greater than two and truncated if quality scores fell below two.

The proposed implementation is correct:

vsearch \
    --fastq_filter "some input" \
    --fastq_trunclen 200 \
    --fastq_maxee 2 \
    --fastq_maxns 0 \
    --fastq_truncqual 2 > "some output"

It could be improved as such:

vsearch \
    --fastx_filter "some input" \
    --fastq_trunclen_keep 200 \
    --fastq_maxee 2.0 \
    --fastq_maxns 0 \
    --fastq_truncqual 2 > "some output"

Finally, the --fastq_truncqual value is dataset-dependent and could be deduced from --fastq_stats:

vsearch \
    --fastq_stats "some input" \
    --log "log file"

At the end of the log file, for my particular dataset, --fastq_truncqual 5 would yield a length of 148 nucleotides for more than 95% of the reads.

Truncate at first Q
  Len     Q=5    Q=10    Q=15    Q=20
-----  ------  ------  ------  ------
  151   68.3%   68.3%   12.2%   12.2%
  150   80.2%   80.2%   16.8%   16.8%
  149   94.3%   94.3%   18.9%   18.9%
  148   95.3%   95.3%   19.3%   19.3%
  147   95.6%   95.6%   19.7%   19.7%
  146   95.7%   95.7%   20.2%   20.2%
  145   95.8%   95.8%   20.6%   20.6%
  144   95.8%   95.8%   20.9%   20.9%
  143   95.9%   95.9%   21.4%   21.4%
  142   95.9%   95.9%   21.8%   21.8%
  141   96.0%   96.0%   22.1%   22.1%
  140   96.1%   96.1%   22.6%   22.6%
  139   96.1%   96.1%   23.0%   23.0%
frederic-mahe commented 9 months ago

I've added tests covering this specific usage (see https://github.com/frederic-mahe/vsearch-tests/commit/030951fa24feb895fe741d498ea9060f714f24bd)