torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
671 stars 125 forks source link

from fasta files to an OTU table #536

Closed frederic-mahe closed 1 year ago

frederic-mahe commented 1 year ago

A question asked on vsearch's Google Groups (Inconsistent reads between derep.fa and otutab?) raised the concern that the number of reads might change between dereplication and OTU mapping.

If there is no filtering (removal of singletons, quality filtering, length-based filtering, etc.), then the number of reads in the initial fasta files and in the final OTU table should be exactly the same.

Demonstrating this behavior is not trivial though. So for future reference, here is a toy-example showing the process of producing an OTU table for a project with more than one sample.

In short:

This last step allows to reconstruct an OTU table (or occurrence table) indicating for each cluster its number of occurrence in each sample. Here is the expected OTU table:

#OTU ID sample1 sample2
s1  1   1
s2  0   1
s3  1   0
cd /tmp/

TMP_DIR="$(mktemp --directory)"

(cd "${TMP_DIR}"

 ## create two fasta files
 printf ">s1\nAA\n>s3\nCC\n" > sample1.fasta
 printf ">s1\nAA\n>s2\nGG\n" > sample2.fasta

 ## dereplicate each file, add file name to headers
 for FASTA in sample1.fasta sample2.fasta ; do
     vsearch \
         --derep_fulllength ${FASTA} \
         --minseqlength 2 \
         --sizeout \
         --sample ${FASTA/\.fasta/} \
         --quiet \
         --output ${FASTA/\.fasta/}_derep.fasta
 done

 ## pool samples
 cat sample*_derep.fasta > all.fasta

 ## dereplicate (global)
 vsearch \
     --derep_fulllength all.fasta \
     --minseqlength 2 \
     --sizein \
     --sizeout \
     --relabel_keep \
     --quiet \
     --output all_derep.fasta

 ## clusterize
 vsearch \
     --cluster_size all_derep.fasta \
     --minseqlength 2 \
     --id 0.97 \
     --strand plus \
     --sizein \
     --sizeout \
     --quiet \
     --centroids centroids.fasta

 ## map sequences from pooled samples to clusters
 vsearch \
     --usearch_global all.fasta \
     --db centroids.fasta \
     --minseqlength 2 \
     --id 0.97 \
     --strand plus \
     --sizein \
     --sizeout \
     --qmask none \
     --dbmask none \
     --quiet \
     --otutabout otutab.tsv

 ## check sum of reads
 awk 'NR > 1 {for (i=2 ; i<=NF ; i++) {s += $i}} END {print s}' otutab.tsv
 )

rm -rf "${TMP_DIR}"

The exact way the --otutabout output option parses fasta headers is not completely documented yet. This will be discussed in another issue.

frederic-mahe commented 1 year ago

24 tests covering some of the --otutabout features have been added to the test-suite (https://github.com/frederic-mahe/vsearch-tests/commit/f6577d7ae3962d09e5946568b645f3237166570e)