I just came over these parts of the BLAST documentation and I was surprised that we don't use these features yet, that improve BLAST speed and could (potentially) improve the slowest part of MG7 pipeline. I suppose that @eparejatobes and @rtobes knew about them so there is a reason not to use them. Otherwise, let's discuss it.
BLAST works more efficiently if it scans the database once for multiple queries. This feature is known as concatenation. [...]
BLAST+ applies concatenation on all types of searches (e.g., also BLASTP, etc.), and it can be very beneficial if the input is a large number of queries in FASTA format.
The BLASTN application (starting with the 2.2.28 release) takes advantage of this insight to provide an “adaptive chunk size”. [...]
As I understand it doesn't require anything from the user, just a set of sequences for query. So as we already split reads files on chunks, we could pass whole chunks (if not too big) to BLAST, instead of read-by-read runs as we do now.
Indexing provides an alternative way to search for initial matches in nucleotide-nucleotide searches (blastn and megablast) by pre-indexing the N-mer locations in a special data structure, called a database index.
Using an index can improve search times significantly under certain conditions. It is most beneficial when the queries are much shorter than the database and works best for queries under 1 Mbases long. The advantage comes from the fact that the whole database does not have to be scanned during the search.
This index can be generated using makembindex command:
We could pre-generate this index for the db.rna16s. I don't think that increased size of the DB+index files is a concern (currently used c3.large instances have 32GB of storage). Alternatively, if this has to be specific to each project, we could generate index on each worker instance initialization (once and before processing any reads).
I just came over these parts of the BLAST documentation and I was surprised that we don't use these features yet, that improve BLAST speed and could (potentially) improve the slowest part of MG7 pipeline. I suppose that @eparejatobes and @rtobes knew about them so there is a reason not to use them. Otherwise, let's discuss it.
"Concatenation of queries"
As I understand it doesn't require anything from the user, just a set of sequences for query. So as we already split reads files on chunks, we could pass whole chunks (if not too big) to BLAST, instead of read-by-read runs as we do now.
"Megablast indexed searches"
This index can be generated using
makembindex
command:We could pre-generate this index for the db.rna16s. I don't think that increased size of the DB+index files is a concern (currently used
c3.large
instances have 32GB of storage). Alternatively, if this has to be specific to each project, we could generate index on each worker instance initialization (once and before processing any reads).