linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
138 stars 40 forks source link

High Load Issue: dbcan_sub Creating Excessive Threads #151

Closed trx296554555 closed 8 months ago

trx296554555 commented 8 months ago

Thank you for your hard work and recent updates. However, I wanted to bring to your attention an ongoing issue with the latest version of run_dbcan-4.1.1. When processing large input sequence files, dbcan_sub tends to create an excessive number of threads, resulting in high system load. #117

This issue persists even when specifying parameters such as --dbcan_thread and --hmm_cpu, as there seems to be no effective limitation on the number of threads being created. 1705219070263

After reviewing the code of run_dbcan.py, I have identified that the issue lies within the function split_uniInput. This section of code directly launches as many subprocesses as the number of small files generated by splitting the large input file.https://github.com/linnabrown/run_dbcan/blob/707aed21a0ef455828126f1afb5820963e8274ca/dbcan/cli/run_dbcan.py#L139C1-L157C22

I made modifications to this specific code section to prevent excessive load when I used it myself. I implemented a simple ThreadPool, but I'm unsure if this could potentially affect other parts of the program. Therefore, I offer it as a reference only.

from concurrent.futures import ThreadPoolExecutor, as_completed

def run_command(cmd):
    hmmer = Popen(cmd)
    hmmer.wait()
    return cmd

max_workers = dbcan_thread  
cmds = []
for j in split_files:
    cmds.append(["hmmsearch", "--domtblout", f"{outPath}d{j}", "--cpu", "2", "-o", "/dev/null",
                 f"{dbDir}dbCAN_sub.hmm", f"{outPath}{j}"])

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = [executor.submit(run_command, cmd) for cmd in cmds]
    for future in as_completed(futures):
        try:
            command = future.result()
            print(f"Command: {' '.join(command)} already completed.")
        except Exception as e:
            print(f"An error occurred: {e}")

Best, Robin

linnabrown commented 8 months ago

Hi Robin,

Thank you so much for bringing this out. We previously utilized this manner due to hmmscan does not support multithreads but hmmsearch does. Therefore, we will remove the multi-processing part and just use the multi-threading butil-in function from hmmsearch.

Let me just delete and test codes and I will put 4.1.2 version. Thank you so much!

Best, Le

HaidYi commented 8 months ago

Our 4.1.2 version is already issued. Problem solved.