linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
131 stars 40 forks source link

dbcan_sub will create tons of subprocesses #117

Open chtsai0105 opened 1 year ago

chtsai0105 commented 1 year ago

Hi, I was running dbcan with --tools hmmer dbcan to only run hmmer and dbcan_sub on our cluster but found the job created a large amount of subprocesses. I checked the codes and found a suspected section:

https://github.com/linnabrown/run_dbcan/blob/f3dd111a9eaf552cdca14f4c8db06baabaa1f2f8/dbcan_cli/run_dbcan.py#L47-L121

In line 62, it calculate the file size in M and times an offset (which is 3) and assign this value to the variable fsize. So if my uniInput is 43M the fsize will be 43 * 3 = 129.

And later in line 73-76, it will create 129 temp files (0.txt, 1.txt ... 128.txt) and store the filenames in the variable split_files. However in line 89-90, the it run hmmscan on all 129 temp files with 5 cpu per job. That means it will use 129 * 5 = 645 cpus.

Although we also take parameter dbcan_thread in this split_uniInput function but it is not used to determine how many jobs should be run parallelly but only use to decide whether we should run this multiprocess codes. https://github.com/linnabrown/run_dbcan/blob/f3dd111a9eaf552cdca14f4c8db06baabaa1f2f8/dbcan_cli/run_dbcan.py#L72

I don't think this is the behavior we expected... Or maybe I made a mistake in interpret the codes?

linnabrown commented 1 year ago

I was busy these two days since a long trip. I will give you response in the next week.

QiweiGe commented 1 year ago

Hi @chtsai0105 , the reason we split the file into parts is the dbcan_sub database is big, and if you have a file that is 43M, it takes days to get the result. In this case, you can change your own offset as you need. Thanks.

chtsai0105 commented 1 year ago

Hi - I reviewed the codes and made some changes that allow user to use hmmsearch instead of hmmscan. I've sent a pull request and you can see the details in it.

cmkobel commented 10 months ago

I understand the point of splitting the files, but the problem is that any computer will run inefficiently when more threads are spawned than what the hardware can support. I just tried calling cazymes with dbcan (newest version) on a .faa with 4 million sequences and I had a very hard time recovering my machine from the ~20 million spawned threads that dbcan had incurred. If dbcan should spawn more processes, it should never go beyond a set upper thread limit.

Panda-smile commented 6 months ago

我也遇到了同样的问题如何解决呢?

image
Panda-smile commented 6 months ago

程序运行一段时间就erro了,如何解呢?

image image
linnabrown commented 6 months ago

Did you update the dbcan package? We just updated yesterday @zhangbenbenchina

Panda-smile commented 6 months ago

Thanks professor. I will update later. 

发自我的iPhone

------------------ Original ------------------ From: Le (Lena) Huang @.> Date: Fri,Jan 12,2024 0:04 AM To: linnabrown/run_dbcan @.> Cc: zhangbenbenchina @.>, Mention @.> Subject: Re: [linnabrown/run_dbcan] dbcan_sub will create tons of subprocesses(Issue #117)

Did you update the dbcan package? We just updated yesterday @zhangbenbenchina

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>