systemsomicslab / MsdialWorkbench

Universal workbench incorporating msdial, msfinder, and mrmprobs
https://systemsomicslab.github.io/compms/msdial/main.html
44 stars 13 forks source link

MsdialConsoleApp - added support for multiple CPU threads #300

Closed ondrej-kuda closed 5 months ago

ondrej-kuda commented 5 months ago

Added support for multiple CPU threads in lcmsdda mode for MsdialConsoleApp, version 4.9.27012024. E.g. use number of threads:64 in the parameter file to run processing on 64 threads. For efficiency and not to violate job scheduler rules, the actual number of threads is limited to the maximum number of cores available. Successfully processed both mzML and abf files in positive and negative modes on Windows and linux-x64 versions.

YukiMatsuzawa commented 5 months ago

Thanks for the great update! I'll merge this PR right away!

kozo2 commented 5 months ago

@ondrej-kuda Thank you very much, Ondrej. @YukiMatsuzawa has already merged your contribution, but I would like to propose two further collaborations to make your PR even better. One is 'adding test code / GitHub Actions', and the other is 'making the LcmsDdaProcess, which you improved, executable from [C#(Polyglot) Notebooks]'. If you have any interest or opinions on these, I would be grateful for a reply. I am hoping we can discuss what kind of data would be best to use for those tasks.

ondrej-kuda commented 5 months ago

I can think about the 'test code' but I have not yet explored the v5 way of processing files with multiple libraries and identities. I am not familiar with [C#(Polyglot) Notebooks]. I tested alternatives to the semaphore: simple async and parallel.foreach, and the semaphore was the fastest. However, the gui implementation is still slightly faster than the console version (considering comparable SPECfp2017). I tried to rewrite the entire pipeline from the GUI v4 to the console version, without the mainWindow stuff, but it was not faster than the simple semaphore. Anyway, the main benefit is the ability to run it on a linux cluster with dozens of cores, which compensates this limit well.

kozo2 commented 5 months ago

@ondrej-kuda

I can think about the 'test code' but I have not yet explored the v5 way of processing files with multiple libraries and identities.

Absolutely. I also hadn't considered up to v5 yet. For now, I would like to consider 'test code' for the v4 MsdialConsoleApp. If you know of a moderately sized dataset suitable for multiple CPU threads, please let me know. I would like to use it as input and try running your contribution on GitHub Actions.

I am not familiar with [C#(Polyglot) Notebooks].

Sorry, I forgot to attach the link to that technology. https://code.visualstudio.com/docs/languages/polyglot is the link. https://github.com/dotnet/csharp-notebooks is the notebook collection. As introduced in https://github.com/dotnet/interactive/blob/main/docs/FAQ.md#can-i-directly-load-a-net-assembly , by runnin #r "/path/to/assembly.dll", we can import the DLL of MS-DIAL.

kozo2 commented 5 months ago

@ondrej-kuda

I tested alternatives to the semaphore: simple async and parallel.foreach, and the semaphore was the fastest. However, the gui implementation is still slightly faster than the console version (considering comparable SPECfp2017). I tried to rewrite the entire pipeline from the GUI v4 to the console version, without the mainWindow stuff, but it was not faster than the simple semaphore. Anyway, the main benefit is the ability to run it on a linux cluster with dozens of cores, which compensates this limit well.

Thank you for the information. I will prioritize checking Linux performance over that of Windows.