torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Question about the query file of -usearch_global command when creating OTU tables #552

Closed magicprotoss closed 2 months ago

magicprotoss commented 5 months ago

In current version of usearch, there's an -otutab command to directly map reads to otutable. Accroding to the usearch manual, the recommanded input is raw reads after paired end merging and before primer trimming: https://drive5.com/usearch/manual/cmd_otutab.html

In current version of vsearch, the --usearch_global command only support fasta inputs. And from the example pipeline: https://github.com/torognes/vsearch/wiki/Alternative-VSEARCH-pipeline I noticed the primer removal and maxEE filtering have been applied to the input reads.

I wonder if the --usearch_global command in vsearch support searching against the raw reads (mentioned in the first block) directly converted to fasta from fastq, that way more reads could be mapped to the otu table.

torognes commented 5 months ago

Currently, the usearch_global command does not support FASTQ files as input. It is trivial to implement, and we can do that for the next release.

Thanks for the question / suggestion.

If you are going to use the usearch_global command to map sequences to OTUs and generate an OTU table, you should remember to include --maxhits 1 or --strand plus as an option to avoid multiple hits (one on each strand) in rare cases.

magicprotoss commented 5 months ago

Thanks for the quick reply and the tip. My main question is if the command supports mapping reads before primer removal and quality filtering to OTU table. Supporting fastq input is not quite as important because performing raw convertion from fastq to fasta is an easy step.

torognes commented 5 months ago

Ok, now I think I understand your question.

I would not recommend running usearch_global on raw, unprocessed reads. At least not with adapters/primers. It may work if you set the --id threshold low, like 0.8, but you may risk mapping to the wrong OTUs.

Let's say there is a primer or adapter of length 20bp in one end of the read, with a total read length of 150bp. This part of the read will probably not align to the OTU sequence, and the percentage of aligned nucleotides will fall to 100%*(150-20)/150 = 87%. If you perform usearch_global with an id threshold of 0.8, corresponding to 80%, you risk matching an OTU with low similarity, as the heuristic algorithm in vsearch will report a match as soon as it finds an OTU with at least 80% identity.

There is a similar problem with sequencing errors, but they are usually much more rare and does not pose the same problem.

I think I will anyway allow the usearch_global and search_exact commands to use FASTQ files as input, as FASTQ sequences are often trimmed to remove adapters or primers.

Maybe others with more experience will have other opinions.

magicprotoss commented 5 months ago

Thanks for the info :)

frederic-mahe commented 2 months ago

Assuming you have two reads in fastq format (already merged and trimmed), from two samples S1 and S2:

@q1;sample=S1
A
+
I
@q2;sample=S2
A
+
I

and a fasta file containing cluster representatives:

>t
A

Starting with vsearch v2.27, you can map the cluster seed t to your two samples S1 and S2:

vsearch \
    --usearch_global <(printf "@q1;sample=S1\nA\n+\nI\n@q2;sample=S2\nA\n+\nI\n") \
    --db <(printf ">t\nA\n") \
    --minseqlength 1 \
    --id 1.00 \
    --quiet \
    --otutabout -
#OTU ID S1  S2
t   1   1
frederic-mahe commented 2 months ago

tests added to our test-suite (see https://github.com/frederic-mahe/vsearch-tests/commit/e962aec137d878d6f80a49dc8a83f3f083fb288c)