steineggerlab / Metabuli

Metabuli: specific and sensitive metagenomic classification via joint analysis of DNA and amino acid.
GNU General Public License v3.0
118 stars 10 forks source link

Performance issue #71

Closed Maiya19724 closed 3 months ago

Maiya19724 commented 4 months ago

Thank you for providing such an excellent tool. I have some questions regarding its performance.

My dataset contains 1,005,878 sequences with a total length of 11,798,865,237. I am using the GTDB-r214 database. Despite utilizing all available threads and limiting the RAM to 200GB, the processing speed seems quite slow. I have noticed that the CPU usage remains relatively low for most of the runtime. (It seems that the CPU is fully used only during the "Analyzing matches..." phase. )

Are there any methods to improve the processing speed?

jaebeom-kim commented 3 months ago

Sorry for the delayed answer. It seems like your data is a long-read sample, and Metabuli takes more time in "Analyzing matches..." phase for long-read data. How many threads are you using? I observed that increasing the number of threads decreased the CPU usage during process touching the hard disk. And the "Analyzing matches..." phase doesn't touch hard disk. Let me explore if CPU usage can be secured when using a large number of threads.

It would be very helpful if you can provide the printed log (the message printed with "Analyzing matches..."). I also want to know if the task was completed at the end. Thanks a lot!

Maiya19724 commented 3 months ago

The large analysis did not successfully run in the end, (PS: I am currently unable to provide the log files. If you need them, I can try rerunning the pipeline to generate the log files.) then I split it into 24 smaller datasets, each around 5GB. These were successfully run using 112 threads, limiting the RAM to 200. Additionally, I ran Metabuli on the HMP1 data, and the species abundance results were consistent with literature reports. This tool has been very helpful, thank you. However, I encountered some other issues during use:

  1. The program would suddenly crash while running. I was running it in the terminal (not in the background), and I noticed that when it crashed, the memory and threads resources were not fully utilized, not even reaching 60% sometime.

  2. When analyzing certain datasets, it would run successfully with a low RAM setting, but fail with a high RAM setting (which is related to the crash I mentioned above).

Thank you very much for your response, wish you all the best! @jaebeom-kim

jaebeom-kim commented 3 months ago

Thank you for detailed explanations. The latest release solved an issue related to --max-ram setting, I hope this fixation also would solve your problem. Please try it when you afford :)

Best regards!

Maiya19724 commented 3 months ago

I will give it a try. Thank you for your hard work!