Closed magicprotoss closed 2 months ago
Currently, the usearch_global
command does not support FASTQ files as input. It is trivial to implement, and we can do that for the next release.
Thanks for the question / suggestion.
If you are going to use the usearch_global
command to map sequences to OTUs and generate an OTU table, you should remember to include --maxhits 1
or --strand plus
as an option to avoid multiple hits (one on each strand) in rare cases.
Thanks for the quick reply and the tip. My main question is if the command supports mapping reads before primer removal and quality filtering to OTU table. Supporting fastq input is not quite as important because performing raw convertion from fastq to fasta is an easy step.
Ok, now I think I understand your question.
I would not recommend running usearch_global
on raw, unprocessed reads. At least not with adapters/primers. It may work if you set the --id
threshold low, like 0.8, but you may risk mapping to the wrong OTUs.
Let's say there is a primer or adapter of length 20bp in one end of the read, with a total read length of 150bp. This part of the read will probably not align to the OTU sequence, and the percentage of aligned nucleotides will fall to 100%*(150-20)/150 = 87%. If you perform usearch_global
with an id
threshold of 0.8, corresponding to 80%, you risk matching an OTU with low similarity, as the heuristic algorithm in vsearch will report a match as soon as it finds an OTU with at least 80% identity.
There is a similar problem with sequencing errors, but they are usually much more rare and does not pose the same problem.
I think I will anyway allow the usearch_global
and search_exact
commands to use FASTQ files as input, as FASTQ sequences are often trimmed to remove adapters or primers.
Maybe others with more experience will have other opinions.
Thanks for the info :)
Assuming you have two reads in fastq format (already merged and trimmed), from two samples S1
and S2
:
@q1;sample=S1
A
+
I
@q2;sample=S2
A
+
I
and a fasta file containing cluster representatives:
>t
A
Starting with vsearch v2.27, you can map the cluster seed t
to your two samples S1
and S2
:
vsearch \
--usearch_global <(printf "@q1;sample=S1\nA\n+\nI\n@q2;sample=S2\nA\n+\nI\n") \
--db <(printf ">t\nA\n") \
--minseqlength 1 \
--id 1.00 \
--quiet \
--otutabout -
#OTU ID S1 S2
t 1 1
tests added to our test-suite (see https://github.com/frederic-mahe/vsearch-tests/commit/e962aec137d878d6f80a49dc8a83f3f083fb288c)
In current version of usearch, there's an -otutab command to directly map reads to otutable. Accroding to the usearch manual, the recommanded input is raw reads after paired end merging and before primer trimming: https://drive5.com/usearch/manual/cmd_otutab.html
In current version of vsearch, the --usearch_global command only support fasta inputs. And from the example pipeline: https://github.com/torognes/vsearch/wiki/Alternative-VSEARCH-pipeline I noticed the primer removal and maxEE filtering have been applied to the input reads.
I wonder if the --usearch_global command in vsearch support searching against the raw reads (mentioned in the first block) directly converted to fasta from fastq, that way more reads could be mapped to the otu table.