A question asked on vsearch's Google Groups (Inconsistent reads between derep.fa and otutab?) raised the concern that the number of reads might change between dereplication and OTU mapping.
If there is no filtering (removal of singletons, quality filtering, length-based filtering, etc.), then the number of reads in the initial fasta files and in the final OTU table should be exactly the same.
Demonstrating this behavior is not trivial though. So for future reference, here is a toy-example showing the process of producing an OTU table for a project with more than one sample.
In short:
starting from samples in fasta format,
each sample is dereplicated and the sample name is added to each fasta header,
dereplicated fasta files are pooled (all.fasta),
all.fasta is dereplicated (recommended for clustering),
clustering to identify cluster centroids,
map all.fasta entries to cluster centroids
This last step allows to reconstruct an OTU table (or occurrence table) indicating for each cluster its number of occurrence in each sample. Here is the expected OTU table:
#OTU ID sample1 sample2
s1 1 1
s2 0 1
s3 1 0
cd /tmp/
TMP_DIR="$(mktemp --directory)"
(cd "${TMP_DIR}"
## create two fasta files
printf ">s1\nAA\n>s3\nCC\n" > sample1.fasta
printf ">s1\nAA\n>s2\nGG\n" > sample2.fasta
## dereplicate each file, add file name to headers
for FASTA in sample1.fasta sample2.fasta ; do
vsearch \
--derep_fulllength ${FASTA} \
--minseqlength 2 \
--sizeout \
--sample ${FASTA/\.fasta/} \
--quiet \
--output ${FASTA/\.fasta/}_derep.fasta
done
## pool samples
cat sample*_derep.fasta > all.fasta
## dereplicate (global)
vsearch \
--derep_fulllength all.fasta \
--minseqlength 2 \
--sizein \
--sizeout \
--relabel_keep \
--quiet \
--output all_derep.fasta
## clusterize
vsearch \
--cluster_size all_derep.fasta \
--minseqlength 2 \
--id 0.97 \
--strand plus \
--sizein \
--sizeout \
--quiet \
--centroids centroids.fasta
## map sequences from pooled samples to clusters
vsearch \
--usearch_global all.fasta \
--db centroids.fasta \
--minseqlength 2 \
--id 0.97 \
--strand plus \
--sizein \
--sizeout \
--qmask none \
--dbmask none \
--quiet \
--otutabout otutab.tsv
## check sum of reads
awk 'NR > 1 {for (i=2 ; i<=NF ; i++) {s += $i}} END {print s}' otutab.tsv
)
rm -rf "${TMP_DIR}"
The exact way the --otutabout output option parses fasta headers is not completely documented yet. This will be discussed in another issue.
A question asked on vsearch's Google Groups (Inconsistent reads between derep.fa and otutab?) raised the concern that the number of reads might change between dereplication and OTU mapping.
If there is no filtering (removal of singletons, quality filtering, length-based filtering, etc.), then the number of reads in the initial fasta files and in the final OTU table should be exactly the same.
Demonstrating this behavior is not trivial though. So for future reference, here is a toy-example showing the process of producing an OTU table for a project with more than one sample.
In short:
all.fasta
),all.fasta
is dereplicated (recommended for clustering),all.fasta
entries to cluster centroidsThis last step allows to reconstruct an OTU table (or occurrence table) indicating for each cluster its number of occurrence in each sample. Here is the expected OTU table:
The exact way the
--otutabout
output option parses fasta headers is not completely documented yet. This will be discussed in another issue.