Hidden BLAST features - Githubissues

laughedelic commented 7 years ago

I just came over these parts of the BLAST documentation and I was surprised that we don't use these features yet, that improve BLAST speed and could (potentially) improve the slowest part of MG7 pipeline. I suppose that @eparejatobes and @rtobes knew about them so there is a reason not to use them. Otherwise, let's discuss it.

"Concatenation of queries"

BLAST works more efficiently if it scans the database once for multiple queries. This feature is known as concatenation. [...]
BLAST+ applies concatenation on all types of searches (e.g., also BLASTP, etc.), and it can be very beneficial if the input is a large number of queries in FASTA format.
The BLASTN application (starting with the 2.2.28 release) takes advantage of this insight to provide an “adaptive chunk size”. [...]

As I understand it doesn't require anything from the user, just a set of sequences for query. So as we already split reads files on chunks, we could pass whole chunks (if not too big) to BLAST, instead of read-by-read runs as we do now.

"Megablast indexed searches"

Indexing provides an alternative way to search for initial matches in nucleotide-nucleotide searches (blastn and megablast) by pre-indexing the N-mer locations in a special data structure, called a database index.
Using an index can improve search times significantly under certain conditions. It is most beneficial when the queries are much shorter than the database and works best for queries under 1 Mbases long. The advantage comes from the fact that the whole database does not have to be scanned during the search.

This index can be generated using makembindex command:

see Table C11 in the CLI Reference
and "Indexed megaBLAST search" for examples

We could pre-generate this index for the db.rna16s. I don't think that increased size of the DB+index files is a concern (currently used c3.large instances have 32GB of storage). Alternatively, if this has to be specific to each project, we could generate index on each worker instance initialization (once and before processing any reads).

eparejatobes commented 7 years ago

@laughedelic if I recall correctly, I already discussed this with you some weeks ago. In summary

concatenating queries has (or could have, with BLAST is hard to tell) side-effects regarding the assignments reported, statistics etc.
megaBLAST indexes actually make most queries run slower in practice :1st_place_medal:

eparejatobes commented 7 years ago

Closing.

ohnosequences / mg7

Hidden BLAST features #119

"Concatenation of queries"

"Megablast indexed searches"