ncbi / amr

AMRFinderPlus - Identify AMR genes and point mutations, and virulence and stress resistance genes in assembled bacterial nucleotide and protein sequence.
https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/
Other
256 stars 34 forks source link

[Feature request] Option to use Diamond instead of Blast #111

Open jolespin opened 1 year ago

jolespin commented 1 year ago

Would it be possible to include the option to use diamond as an alternative to blast?

vbrover commented 1 year ago

We use BLAST usually for very strong matches where identity >= 90%. For remote matches we do not use BLAST, but we use HMMer. (But generally, if the goal is to find the protein family then BLAST is not the best tool.) If you know protein families which are incorrectly identified by AMRFinderPlus, please let us know.

oschwengers commented 1 year ago

I think it's quite the opposite. It might be very interesting and advantageous to use Diamond instead of Blast to significantly speed-up the searches for >=90% hits using the --fast mode. We use Diamond in Bakta for these use cases with great results in terms of runtime.

vbrover commented 1 year ago

Could you post an example where BLASTP with identity >=90% and Diamond produce different results (alignments)? Can Diamond replace BLASTP, BLASTN, BLASTX and TBLASTN? How faster is Diamond than BLAST?

oschwengers commented 1 year ago

I think (due to our results and the Diamond publication), they should produce the same results for these highly similar hits, i.e. >=90% id. Due to the publication (figure above) Diamond blastp is ~2 magnitudes faster than blasp in default mode and >3 magnitudes in fast mode suitable for >90% seq id hits. It also provides an blastx mode. As far as I know, blastn/tblastn is not possible. May I kindly refere you to https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#sensitivity-modes

vbrover commented 1 year ago

In AMRFinderPlus tblastn on a 7 Mbp genome takes 90 sec. (using 1 core, 2500 GHz). If that can be made faster that will be a big improvement.

oschwengers commented 1 year ago

Doesn't AMRFinderPlus also use blastp? I think this is where diamond could make a difference.

evolarjun commented 1 year ago

AMRFinderPlus does use blastp, but the way we use blastp parallelizes better, and is faster than the tblastn step that is the slowest step currently, that's why @vbrover brought it up. The blastp does take some time though, so we'll check out your suggestion.

Thanks!

evolarjun commented 1 year ago

Well we haven't yet tried diamond, but this suggestion prompted us to spend some time optimizing the blast parameters. There is likely room for further improvements, but being very conservative and careful to make sure we won't miss any alignments, we improved the time of combined runs by an average of over 50% in version 3.11.8. Unfortunately one of the optimizations caused issues with bioconda. Note that performance and optimization are highly dependent on the input sequences.