Closed Maiya19724 closed 3 months ago
Sorry for the delayed answer. It seems like your data is a long-read sample, and Metabuli takes more time in "Analyzing matches..." phase for long-read data. How many threads are you using? I observed that increasing the number of threads decreased the CPU usage during process touching the hard disk. And the "Analyzing matches..." phase doesn't touch hard disk. Let me explore if CPU usage can be secured when using a large number of threads.
It would be very helpful if you can provide the printed log (the message printed with "Analyzing matches..."). I also want to know if the task was completed at the end. Thanks a lot!
The large analysis did not successfully run in the end, (PS: I am currently unable to provide the log files. If you need them, I can try rerunning the pipeline to generate the log files.) then I split it into 24 smaller datasets, each around 5GB. These were successfully run using 112 threads, limiting the RAM to 200. Additionally, I ran Metabuli on the HMP1 data, and the species abundance results were consistent with literature reports. This tool has been very helpful, thank you. However, I encountered some other issues during use:
The program would suddenly crash while running. I was running it in the terminal (not in the background), and I noticed that when it crashed, the memory and threads resources were not fully utilized, not even reaching 60% sometime.
When analyzing certain datasets, it would run successfully with a low RAM setting, but fail with a high RAM setting (which is related to the crash I mentioned above).
Thank you very much for your response, wish you all the best! @jaebeom-kim
Thank you for detailed explanations.
The latest release solved an issue related to --max-ram
setting, I hope this fixation also would solve your problem.
Please try it when you afford :)
Best regards!
I will give it a try. Thank you for your hard work!
Thank you for providing such an excellent tool. I have some questions regarding its performance.
My dataset contains 1,005,878 sequences with a total length of 11,798,865,237. I am using the GTDB-r214 database. Despite utilizing all available threads and limiting the RAM to 200GB, the processing speed seems quite slow. I have noticed that the CPU usage remains relatively low for most of the runtime. (It seems that the CPU is fully used only during the "Analyzing matches..." phase. )
Are there any methods to improve the processing speed?