taking longer and using more memory than expected

jdmontenegro commented 5 years ago

Dear all,

I am trying to use this tool for mapping RNAseq reads to a plant reference genome. I have 50 libraries with around 2.5Gb of data in each (2 x 100 bp). Most libraries were mapped quite easily using around 120GB of RAM and finished in 1-2 hours with 28 threads. However, a few libraries are problematic. They are either running out of time (>48h) or complaining of "Out of memory" even when they have over 256GB of RAM available.

Using other tools like STAR or tophat I didn't see any difference between running time or memory requirements in these libraries. Have you seen this behaviour before in any dataset? Is there any recommendation you could offer? Please see below details of the command line used. The reference genome was used to produce a Blast DB with "makeblastdb" and reads were aligned to the dataset.

makeblastdb -in ${assembly} -out ${ref} -dbtype nucl magicblast -query ${r1} -query_mate ${r2} -db ${ref} -infmt fastq -num_threads 28 -out ${out}

Kind regards,

Juan D. Montenegro

boratyng commented 5 years ago

Hi Juan,

Thank you for using Magic-BLAST. This may happen when your reads contain a lot of repeats or you are aligning to a genome with alternative sequences.

Magic-BLAST's threads do more work take more memory than for other programs. So usually the best way to reduce memory is to use fewer threads. You cat try 20 or 15.

If you are using version 1.4.0, it may be useful to upgrade to 1.5.0. The new version has better parallelism and filters repeats more aggressively, so it should be faster.

You can also use even more aggressive repeat filtering with -max_db_word_count 10 option. This means that words that appear in the genome more than 10 times will not be alignment seeds. The default is 60 in version 1.4.0 and 30 in 1.5.0). This will reduce memory and run time for a small loss in sensitivity.

How large is your genome? Are these Illumina reads?

Please, let me know if any of this helped.

Thanks, Greg

jdmontenegro commented 5 years ago

Dear Greg,

Thank you for your email. I am currently using magicblast 1.5.0. This is a allotetraploid plant of 1.3 Gbp genome. It is not as repetitive as other plant genomes, but there is a fair amont of paralogous and orthologous sequences. Not as repetitive as wheat though. I will drop the thread use to 16 as recommended and reduce the -max_db_word_count to 10 as suggested.

The libraries are Illumina 2x150 RNA-seq directional reads. I will modify and get back to you with some additional results.

Kind regards,

Juan D. Montenegro

El mar., 24 sept. 2019 a las 14:20, Greg Boratyn (notifications@github.com) escribió:

Hi Juan,

Thank you for using Magic-BLAST. This may happen when your reads contain a lot of repeats or you are aligning to a genome with alternative sequences.

Magic-BLAST's threads do more work take more memory than for other programs. So usually the best way to reduce memory is to use fewer threads. You cat try 20 or 15.

If you are using version 1.4.0, it may be useful to upgrade to 1.5.0. The new version has better parallelism and filters repeats more aggressively, so it should be faster.

You can also use even more aggressive repeat filtering with -max_db_word_count 10 option. This means that words that appear in the genome more than 10 times will not be alignment seeds. The default is 60 in version 1.4.0 and 30 in 1.5.0). This will reduce memory and run time for a small loss in sensitivity.

How large is your genome? Are these Illumina reads?

Please, let me know if any of this helped.

Thanks, Greg

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ncbi/magicblast/issues/12?email_source=notifications&email_token=ACHSLOQ3KZZ5D6M2GAH5UBDQLJSAJA5CNFSM4IZMC4R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7PQLHY#issuecomment-534709663, or mute the thread https://github.com/notifications/unsubscribe-auth/ACHSLOXTIT55F6W3YJ5I6LTQLJSAJANCNFSM4IZMC4RQ .

ncbi / magicblast

taking longer and using more memory than expected #12