Open jolespin opened 1 year ago
We use BLAST usually for very strong matches where identity >= 90%. For remote matches we do not use BLAST, but we use HMMer. (But generally, if the goal is to find the protein family then BLAST is not the best tool.) If you know protein families which are incorrectly identified by AMRFinderPlus, please let us know.
I think it's quite the opposite. It might be very interesting and advantageous to use Diamond instead of Blast to significantly speed-up the searches for >=90% hits using the --fast
mode. We use Diamond in Bakta for these use cases with great results in terms of runtime.
Could you post an example where BLASTP with identity >=90% and Diamond produce different results (alignments)? Can Diamond replace BLASTP, BLASTN, BLASTX and TBLASTN? How faster is Diamond than BLAST?
I think (due to our results and the Diamond publication), they should produce the same results for these highly similar hits, i.e. >=90% id. Due to the publication (figure above) Diamond blastp is ~2 magnitudes faster than blasp in default mode and >3 magnitudes in fast mode suitable for >90% seq id hits. It also provides an blastx mode. As far as I know, blastn/tblastn is not possible. May I kindly refere you to https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#sensitivity-modes
In AMRFinderPlus tblastn
on a 7 Mbp genome takes 90 sec. (using 1 core, 2500 GHz).
If that can be made faster that will be a big improvement.
Doesn't AMRFinderPlus also use blastp
? I think this is where diamond
could make a difference.
AMRFinderPlus does use blastp, but the way we use blastp parallelizes better, and is faster than the tblastn step that is the slowest step currently, that's why @vbrover brought it up. The blastp does take some time though, so we'll check out your suggestion.
Thanks!
Well we haven't yet tried diamond, but this suggestion prompted us to spend some time optimizing the blast parameters. There is likely room for further improvements, but being very conservative and careful to make sure we won't miss any alignments, we improved the time of combined runs by an average of over 50% in version 3.11.8. Unfortunately one of the optimizations caused issues with bioconda. Note that performance and optimization are highly dependent on the input sequences.
Would it be possible to include the option to use diamond as an alternative to blast?