mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
56 stars 7 forks source link

Feature request: blastn could use multiple threads #11

Closed DarrenObbard closed 2 years ago

DarrenObbard commented 3 years ago

Hi!

I'm assuming its better to create new issues, rather than mush multiple issues into one? Tell me if you'd prefer I don't open new ones!

blastn (--known_strains blast_knowns) takes an age - is there a reason to only use one thread? and/or do one job at a time? From memory, blastn is most efficient at around 4-6 threads.

I guess it may depend on how many sequences there are to search, but if there are just a few then -num_threads 4 might be faster?

Thanks!

Darren

mtisza1 commented 3 years ago

Hi Darren,

Thanks for making the request. In the end, running BLASTN against Genbank's nt database takes an age.

To be clear, Cenote-Taker2 uses xargs to parallelize BLASTN using 1 CPU per sequence, allowing as many simultaneous queries as there are CPUs available. Based on my tests, parallel (instead of xargs) takes the same amount of time.

Nevertheless, I've updated the BLASTN command to run faster hopefully without missing borderline intraspecific alignments by adding the arguments -task megablast -evalue 1e-20 -word_size 26.

To update:

cd Cenote-Taker2
git pull
DarrenObbard commented 3 years ago

Thanks!

I appreciate this! My concern was that running a separate blast job for each file is the slow point, because the blast database needs to be loaded. That is to say, I was guessing that for 1000 files and 30 CPUs, doing

ls *.fas | xargs -P 30 blastn -numthreads 1 -task megablast -query {} -db nr -outfmt 6 -out out{}

is slower than

cat *.fas | parallel -k -j 10 --recstart '>' --pipe blastn -num_threads 3 -task megablast -db nr -outfmt 6 | awk '{print>$1}'

because the former needs to load blast and its database 1000 times, but the latter only needs to load it 10 times.

But it is just a guess! I've not tried it

(The syntax is just a sketch! I've not tired that either and it would need a lot of fixing)