Open colinbrislawn opened 1 day ago
It's hard to share a benchmark in a comment, but:
Using a small max_target_seqs
, parameter sweep low values of perc_identity
blastn -query rep_seq/dna-sequences.fasta -db ../dbs/ncbi_16S_db/16S_ribosomal_RNA \
-outfmt '6' -max_target_seqs 10 -perc_identity ?? > blast_output_??.txt
We would expect poor results for -perc_identity 50
but, the top 10 hits are stable all the way up to 83%
md5sum blast_output_*
afd6e2fbe9a1cceed9e55dfaefecce8c blast_output_50.txt
afd6e2fbe9a1cceed9e55dfaefecce8c blast_output_60.txt
afd6e2fbe9a1cceed9e55dfaefecce8c blast_output_70.txt
afd6e2fbe9a1cceed9e55dfaefecce8c blast_output_80.txt
afd6e2fbe9a1cceed9e55dfaefecce8c blast_output_81.txt
afd6e2fbe9a1cceed9e55dfaefecce8c blast_output_82.txt
f143ffcc108ffdb6a8843d52d9f401bc blast_output_83.txt
And what's different once we request >83% similar?
git diff --no-index blast_output_50.txt blast_output_83.txt
We are missing just 4 hits... all of which are <83% similar.
https://www.ncbi.nlm.nih.gov/books/NBK279684/#_appendices_Outline_of_the_BLAST_process_
C. Loop over every sequence in the database, performing the following actions:
This exhaustive search explain why blast has a slow reputation compared to usearch's fail-fast heuristics
I think that's the case for usearch / vsearch, but not for blast