ohnosequences / mg7

Configurable and scalable 16S metagenomics data analysis
https://goo.gl/y3rZFD
GNU Affero General Public License v3.0
3 stars 3 forks source link

Hidden BLAST features #119

Closed laughedelic closed 7 years ago

laughedelic commented 7 years ago

I just came over these parts of the BLAST documentation and I was surprised that we don't use these features yet, that improve BLAST speed and could (potentially) improve the slowest part of MG7 pipeline. I suppose that @eparejatobes and @rtobes knew about them so there is a reason not to use them. Otherwise, let's discuss it.

"Concatenation of queries"

As I understand it doesn't require anything from the user, just a set of sequences for query. So as we already split reads files on chunks, we could pass whole chunks (if not too big) to BLAST, instead of read-by-read runs as we do now.

"Megablast indexed searches"

This index can be generated using makembindex command:

We could pre-generate this index for the db.rna16s. I don't think that increased size of the DB+index files is a concern (currently used c3.large instances have 32GB of storage). Alternatively, if this has to be specific to each project, we could generate index on each worker instance initialization (once and before processing any reads).

eparejatobes commented 7 years ago

@laughedelic if I recall correctly, I already discussed this with you some weeks ago. In summary

  1. concatenating queries has (or could have, with BLAST is hard to tell) side-effects regarding the assignments reported, statistics etc.
  2. megaBLAST indexes actually make most queries run slower in practice :1st_place_medal:
eparejatobes commented 7 years ago

Closing.