qiime2 / q2-feature-classifier

QIIME 2 plugin supporting taxonomic classification
BSD 3-Clause "New" or "Revised" License
18 stars 38 forks source link

Docs: blastn does search the full database every time #207

Open colinbrislawn opened 1 day ago

colinbrislawn commented 1 day ago

Maximum number of hits to keep for each query. BLAST will choose the first N hits in the reference database that exceed perc-identity similarity to query. NOTE: the database is not sorted by similarity to query, so these are the first N hits that pass the threshold, not necessarily the top N hits. [default: 10]

Note that maxaccepts selects the first N hits with > perc_identity similarity to query, not the top N matches.

I think that's the case for usearch / vsearch, but not for blast

colinbrislawn commented 1 day ago

It's hard to share a benchmark in a comment, but:

Using a small max_target_seqs, parameter sweep low values of perc_identity

blastn -query rep_seq/dna-sequences.fasta -db ../dbs/ncbi_16S_db/16S_ribosomal_RNA \
  -outfmt '6' -max_target_seqs 10 -perc_identity ?? > blast_output_??.txt

We would expect poor results for -perc_identity 50 but, the top 10 hits are stable all the way up to 83%

md5sum blast_output_*

afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_50.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_60.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_70.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_80.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_81.txt
afd6e2fbe9a1cceed9e55dfaefecce8c  blast_output_82.txt
f143ffcc108ffdb6a8843d52d9f401bc  blast_output_83.txt

And what's different once we request >83% similar?

git diff --no-index blast_output_50.txt blast_output_83.txt

image

We are missing just 4 hits... all of which are <83% similar.

colinbrislawn commented 7 hours ago

https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_

C. Loop over every sequence in the database, performing the following actions:

This exhaustive search explain why blast has a slow reputation compared to usearch's fail-fast heuristics