from fasta files to an OTU table

A question asked on vsearch's Google Groups (Inconsistent reads between derep.fa and otutab?) raised the concern that the number of reads might change between dereplication and OTU mapping.

If there is no filtering (removal of singletons, quality filtering, length-based filtering, etc.), then the number of reads in the initial fasta files and in the final OTU table should be exactly the same.

Demonstrating this behavior is not trivial though. So for future reference, here is a toy-example showing the process of producing an OTU table for a project with more than one sample.

In short:

starting from samples in fasta format,
each sample is dereplicated and the sample name is added to each fasta header,
dereplicated fasta files are pooled (all.fasta),
all.fasta is dereplicated (recommended for clustering),
clustering to identify cluster centroids,
map all.fasta entries to cluster centroids

This last step allows to reconstruct an OTU table (or occurrence table) indicating for each cluster its number of occurrence in each sample. Here is the expected OTU table:

#OTU ID sample1 sample2
s1  1   1
s2  0   1
s3  1   0

cd /tmp/

TMP_DIR="$(mktemp --directory)"

(cd "${TMP_DIR}"

 ## create two fasta files
 printf ">s1\nAA\n>s3\nCC\n" > sample1.fasta
 printf ">s1\nAA\n>s2\nGG\n" > sample2.fasta

 ## dereplicate each file, add file name to headers
 for FASTA in sample1.fasta sample2.fasta ; do
     vsearch \
         --derep_fulllength ${FASTA} \
         --minseqlength 2 \
         --sizeout \
         --sample ${FASTA/\.fasta/} \
         --quiet \
         --output ${FASTA/\.fasta/}_derep.fasta
 done

 ## pool samples
 cat sample*_derep.fasta > all.fasta

 ## dereplicate (global)
 vsearch \
     --derep_fulllength all.fasta \
     --minseqlength 2 \
     --sizein \
     --sizeout \
     --relabel_keep \
     --quiet \
     --output all_derep.fasta

 ## clusterize
 vsearch \
     --cluster_size all_derep.fasta \
     --minseqlength 2 \
     --id 0.97 \
     --strand plus \
     --sizein \
     --sizeout \
     --quiet \
     --centroids centroids.fasta

 ## map sequences from pooled samples to clusters
 vsearch \
     --usearch_global all.fasta \
     --db centroids.fasta \
     --minseqlength 2 \
     --id 0.97 \
     --strand plus \
     --sizein \
     --sizeout \
     --qmask none \
     --dbmask none \
     --quiet \
     --otutabout otutab.tsv

 ## check sum of reads
 awk 'NR > 1 {for (i=2 ; i<=NF ; i++) {s += $i}} END {print s}' otutab.tsv
 )

rm -rf "${TMP_DIR}"

The exact way the --otutabout output option parses fasta headers is not completely documented yet. This will be discussed in another issue.

torognes / vsearch

from fasta files to an OTU table #536