Open flatstik opened 2 years ago
I don't think SLURM itself is relevant to this question. This seems to be more a question of how MOPAC can be used effectively in multi-core and multi-node environments, which are often accessed through workload management systems such as SLURM.
The historical development of MOPAC has not focused much on adapting to computing hardware beyond personal computers. As a result, MOPAC's single-core performance is very good, and the simple way to maximize computational throughput with MOPAC is to run jobs on 1 core and fill your computing resources with such 1-core jobs. This simple strategy can run into problems if memory limitations prevent the desired number of simultaneous MOPAC jobs from running. In this case, the subset of dense linear algebra operations that are run as BLAS/LAPACK operations are threaded to improve the performance of the dense linear algebra bottleneck when there are multiple local cores on a machine per job (controlled by OMP_NUM_THREADS and/or MKL_NUM_THREADS as with other OpenMP-based and/or MKL-based multithreading).
MPI-based distributed memory is not presently supported by MOPAC. We are tentatively planning to support partial memory distribution in the future if the eventual dense linear algebra refactoring of MOPAC is successful. Specifically, we plan to isolate and wrap all dense matrix operations and memory locations, which would make it easier to switch between different hardware implementations and memory footprints of the dense linear algebra. Such an abstraction would make it possible to distribute the dense matrices in MOPAC using MPI & BLACS, which are usually the leading-order memory bottleneck, and also to support other memory models such as pinned GPU memory which is a now-standard feature of modern GPGPU programming. There are presently no plans for distributing memory in MOPAC beyond its dense linear algebra.
Support for multi-core and multi-node environments would be great. The other programs are way too expensive and I'd like to see openmopac as alternative to those pricey closed source programs.. Also the GPU support for this is kind of mandatory to have...
xtb is open source and is OpenMP parallellized.
Sure, but it's not nearly as good as openmopac for various reasons
I appreciate this ongoing discussion, and I think it might lead to other interesting follow-on discussions (e.g. the relative merits/capabilities of MOPAC and xTB), but a GitHub Issue might not be the best place for such open-ended discussions. I recently activated the GitHub Discussions board for MOPAC, and I would encourage conversations to branch there as appropriate.
To follow-up on topic, what are specific examples of multi-core or multi-node functionality in commercial, close-source software that MOPAC should seek to replicate and/or consider as a point of reference? Do you have EMPIRE in mind, or are there other distributed semiempirical codes, or are you referring more generally to some other non-semiempirical quantum chemistry code?
Not so much of specific examples, but for example DP5 uses Tinker and Gaussian for optimizing geometries of ligands. Most of my MOZYME runs take over month to achieve SCF field. Week for large PM7 optimizations -- and I do not want to use PM6 or MNDO or whatever oldies are present in most of the programs available. If cluster or "supercomputer" is available - wouldn't you be interested of doing your computations in days rather than in months?
As a side note: I asked (vainly) from Jimmy to implement function to Jimmy for NMR-prediction in specific solvent (which can be done with optimizing the compound first with EPS) He didn't respond to that at all.
Is this linear algebra parallelization actually useful or should I simply discard it and run single threaded? I have compiled mopac with openmp and imkl support. I have run an hydrogen position optimization using mozyme on a large protein with different parameters to see what would happen. I have set separately OMP_NUM_THREADS, MKL_NUM_THREADS and the THREADS keyword to 1, 8 and 16.
In my testing, running single threaded was faster than running multithreaded every time. I'll point out that the number of processes started was not equal to the number of threads (I guess maybe 16 threads were just too many) and that out of all those only one was actually using the cpu during most of the computation. There was plenty of memory available, so that should not be the bottleneck.
The multi-threading only affects the performance of conventional MOPAC calculations, not MOZYME. You should see performance differences in conventional calculations when you change the number of threads on a multi-core processor, and please report it if you do not. The implementation of MOZYME is purely serial and based on nonstandard sparse linear algebra operations - it does not use any dense linear algebra operations that would benefit from multi-threaded math libraries. The underlying algorithm of MOZYME is conceptually amenable to parallelization because of its spatial localization, but introducing threaded programming into MOPAC itself would be a major undertaking and there is presently no development support that would justify it.
Thank you, I will conduct further testing to understand the feasibility of working with such large systems without using MOZYME with parallelization. If that is not possible, I'm afraid I will have to fall back to Gaussian for my calculations.
What range of problem sizes are you working with? Despite being single-core calculations, MOZYME calculations are usually faster than conventional, multi-threaded MOPAC calculations beyond a few hundred atoms. You can still make productive use of multiple cores with MOZYME by running multiple single-core jobs simultaneously.
In reverting to using Gaussian, are you switching from full-protein simulations to extracting small QM regions? You may want to consider a middle ground where you use MOZYME for large QM regions extracted from the full protein, with the artificially terminated residues clamped by geometric constraints. This would be faster than calculating the full protein, and give you better control over finite-size effects than a much smaller QM region, although sufficiently small QM regions enable the use of higher levels of theory than semiempirical models.
Is anybody working with SLURM implementation of openmopac for using clusters (boost + openmpi support)