Open wroa opened 1 year ago
Hi Werner,
Thanks for reporting this issue! HPC support is not so easy to test, and you may be the first to actually give srunmpi
a try. I do think I remember Piotr saying it was problematic on some machines, but this actually looks like it may be an issue in MUSCLE3 rather than with the machine.
It seems that the QCG-PJ agent is trying to execute taskset -c 0 ..../build/X
. This command would be what I'd expect to see when starting a single-threaded model, so it may be that it's actually MUSCLE3 that's not passing on the execution model to QCG-PJ correctly.
A few questions that may help to figure out what's going on:
Did you replace the actual path to the executable in the QCG-PJ log with X
? (that's fine, just checking)
Was there any useful information in instances/micro/stderr.txt
or stdout.txt
?
Finally, could you try running muscle_manager --log-level=DEBUG --start-all rd_implementations.ymmsl rd_cpp_mpi.ymmsl rd_settings.ymmsl
? That should give a lot more output in the manager log describing what MUSCLE3 and QCG-PJ are doing exactly. If you don't want to paste it here, feel free to email it to me directly (l (dot) veen (at) esciencecenter.nl)
I guess the openmpi
and intelmpi
options don't work on this machine, and that's why you're trying srunmpi
?
Thanks in advance!
Hi Lourens,
Thank you for looking into the issue. To answer your questions:
1) I was in the middle of debugging an application which I have been developing (a version of SCEMa using MUSCLE) so I did not want to complicate the report with unnecessary information and instead make the report more general. I ended up copying the example code to my project structure to try and get the skeleton working before slowly adding in the additional functionality.
2) The files stderr.txt, stdout.txt located in the macro model directory are both empty.
3) I have sent an email with the compressed output of running the log-level flag.
4) The cluster ARCHER2 only has srun, it does not have mpirun or mpiexec. The openmpi mode (I think) uses mpirun under the hood. Similarly, ARCHER2 runs on AMD cores and does not have the intel compiler suite.
Once again thank you for the help.
On 11 Dec 2022, at 18:49, Lourens Veen @.***> wrote:
⚠ Caution: External sender
Hi Werner, Thanks for reporting this issue! HPC support is not so easy to test, and you may be the first to actually give srunmpi a try. I do think I remember Piotr saying it was problematic on some machines, but this actually looks like it may be an issue in MUSCLE3 rather than with the machine. It seems that the QCG-PJ agent is trying to execute taskset -c 0 ..../build/X. This command would be what I'd expect to see when starting a single-threaded model, so it may be that it's actually MUSCLE3 that's not passing on the execution model to QCG-PJ correctly. A few questions that may help to figure out what's going on: • Did you replace the actual path to the executable in the QCG-PJ log with X? (that's fine, just checking) • Was there any useful information in instances/micro/stderr.txt or stdout.txt? • Finally, could you try running muscle_manager --log-level=DEBUG --start-all rd_implementations.ymmsl rd_cpp_mpi.ymmsl rd_settings.ymmsl? That should give a lot more output in the manager log describing what MUSCLE3 and QCG-PJ are doing exactly. If you don't want to paste it here, feel free to email it to me directly (l (dot) veen (at) esciencecenter.nl) • I guess the openmpi and intelmpi options don't work on this machine, and that's why you're trying srunmpi? Thanks in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
I've discovered one issue, which is that --overlap
must be passed to srun
for it to be willing to start multiple processes on the same cores. Since QCG-PJ starts its agents using srun
, those cores are occupied already as far as srun
is concerned, and so no instances can be started on them without --overlap
, which QCG-PJ doesn't add.
I've added a patch to do this from the MUSCLE3 side, it may not solve the problem but it should help at least. To be released tomorrow unless disaster strikes.
The above fix was released with 0.6.0. Could you perhaps try it out and see if it helps? It's quite possible that there are other issues as well, but you never know :-).
I just tested out the new version (0.6.0). Unfortunately, I still get a similar error to the one of the previous version. When the micro model is spawned, it crashes, taking the whole simulation with it. Using the C++ example, the output to standard out is as follows:
Due to MODULEPATH changes, the following have been reloaded:
1) cray-mpich/8.1.4
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: can't pickle traceback objects
An error occurred during execution, and the simulation was
shut down. The manager log should tell you what happened.
You can find it at
/mnt/lustre/a2fs-work2/work/e723/e723/werner22/muscle3/Example/ymmsl/run_reaction_diffusion_cpp_mpi_20230118_101754/muscle3_manager.log
and the output of the .../muscle3_manager.log is attached here.
MUSCLE3 version
Release 0.5.0
Expected Behavior
Running a simulation using MPI would complete successfully.
Current Behavior
When running a simulation in the HPC cluster ARCHER2 using the execution model srunmpi, the simulation crashes due to an issue with the QCG-PJM.
Steps to Reproduce
muscle_manager --start-all rd_implementations.ymmsl rd_cpp_mpi.ymmsl rd_settings.ymmsl
Log
Description
I am using UK National Supercomputer ARCHER2. With this current issue I cannot run any sort of simulation that uses the execution model srunmpi. This issue is generic and affects all users on ARCHER2 which use MPI. The issue is most likely caused due to the integration with the QCG-PJM.