jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
225 stars 31 forks source link

-j 30 Does not match the actual number of working threads. Why #166

Closed qkqk-hub closed 1 year ago

qkqk-hub commented 1 year ago

Here is my code: for i in SRR*; do cd $i;virsorter run -w /data_alluser/QK/NewMAGs/PRJDB4176/09_virsorter2/"$i"/ -i ./final.contigs.fa -j 30 --include-groups "dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae" all;cd ..; done I use 30 threads in my code, Why isn't my cpu using 3000%? But very few. Here is a screenshot of my cpu usage: image Thanks!

jiarong commented 1 year ago

Hi, the reason is 1) hmmsearch step can use 4 threads in max in my experience; 2) CPU usage per job is limited by total CPUs you have in your computer and total # of jobs you are running.

qkqk-hub commented 1 year ago

Ok, thank you for your answer.

jianshu93 commented 9 months ago

Hello All,

Even I use only one job, hmmsearch use only 1 thread. I have 24 threads, hmmsearch is the limiting step. It should be much faster using more threads since it is essentially embarrassingly parallel, for each protein sequences, hmmsearch can initialize a thread. Can you please add a thread option to tell hmmsearch how many threads to use? In cases where I have only one file/genome, using just one thread is too slow.

Thanks, Jianshu

jiarong commented 9 months ago

Hi, you can do virsorter config --set HMMSEARCH_THREADS=4. Hmmsearch's multi-threading does NOT work well, usually the IO is the bottleneck, not the CPU though.

jianshu93 commented 9 months ago

Thanks, I need to set thread each time I ran it I assume? Anyway problem solved. Many thanks!

Jianshu

jiarong commented 9 months ago

No, the setting should still work next time. It's recorded in the template-config.yaml file.