multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

Issue using the MPI execution model #143

Open wroa opened 1 year ago

wroa commented 1 year ago

MUSCLE3 version

Release 0.5.0

Expected Behavior

Running a simulation using MPI would complete successfully.

Current Behavior

When running a simulation in the HPC cluster ARCHER2 using the execution model srunmpi, the simulation crashes due to an issue with the QCG-PJM.

Steps to Reproduce

  1. Compile MUSCLE3
  2. Navigate to the examples directory
  3. Change execution model to srunmpi in rd_implementations.ymmsl.in
  4. Try to run the C++ MPI example with the following command: muscle_manager --start-all rd_implementations.ymmsl rd_cpp_mpi.ymmsl rd_settings.ymmsl

Log

% cat .../muscle3_manager.log
muscle_manager 2022-12-09 12:22:48,759 INFO    libmuscle.manager.instance_manager: Planned macro on Resources(nid001779: 0)
muscle_manager 2022-12-09 12:22:48,760 INFO    libmuscle.manager.instance_manager: Planned micro on Resources(nid001779: 0,1)
muscle_manager 2022-12-09 12:22:48,761 INFO    libmuscle.manager.instance_manager: Instantiating macro on Resources(nid001779: 0)
muscle_manager 2022-12-09 12:22:48,762 INFO    libmuscle.manager.instance_manager: Instantiating micro on Resources(nid001779: 0,1)
muscle_manager 2022-12-09 12:22:48,966 ERROR   libmuscle.manager.instance_manager: Instance micro quit with error -1
muscle_manager 2022-12-09 12:22:48,966 ERROR   libmuscle.manager.instance_manager: Output may be found in .../instances/micro
muscle_manager 2022-12-09 12:22:49,267 INFO    libmuscle.manager.instance_manager: Instance macro was shut down by MUSCLE3 because an error occurred elsewhere

% cat .../qcgpj/nl-agent-nid001779.log
2022-12-09 12:22:48,700: agent options: {'binding': True, 'aux_dir': '.../qcgpj', 'log_level': 'info', 'proc_stats': False, 'rt_stats': False, 'rt_wrapper': 'qcg_pj_launch_wrapper'}
2022-12-09 12:22:48,702: gathering process and run-time statistics disabled
2022-12-09 12:22:48,707: agent with id (nid001779) listen at address (tcp://0.0.0.0:12043), export address (tcp://10.253.21.222:12043)
2022-12-09 12:22:48,861: creating process for job 2b2671bbaa784dd79162ad3e3d3cae43 with executable (taskset) and args (['-c', '0', '.../build/X])
2022-12-09 12:22:49,067: canceling application 2b2671bbaa784dd79162ad3e3d3cae43 ...
2022-12-09 12:22:49,084: process for job 2b2671bbaa784dd79162ad3e3d3cae43 finished with exit code -15
2022-12-09 12:22:49,470: node agent nid001779 exiting
2022-12-09 12:22:49,473: Task was destroyed but it is pending!
task: <Task pending name='Task-4' coro=<Agent._cancel_app() done, defined at .../muscle/lib/python3.8/site-packages/qcg/pilotjob/launcher/agent.py:204> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x2b91ca8614f0>()]>>

Description

I am using UK National Supercomputer ARCHER2. With this current issue I cannot run any sort of simulation that uses the execution model srunmpi. This issue is generic and affects all users on ARCHER2 which use MPI. The issue is most likely caused due to the integration with the QCG-PJM.

LourensVeen commented 1 year ago

Hi Werner,

Thanks for reporting this issue! HPC support is not so easy to test, and you may be the first to actually give srunmpi a try. I do think I remember Piotr saying it was problematic on some machines, but this actually looks like it may be an issue in MUSCLE3 rather than with the machine.

It seems that the QCG-PJ agent is trying to execute taskset -c 0 ..../build/X. This command would be what I'd expect to see when starting a single-threaded model, so it may be that it's actually MUSCLE3 that's not passing on the execution model to QCG-PJ correctly.

A few questions that may help to figure out what's going on:

  1. Did you replace the actual path to the executable in the QCG-PJ log with X? (that's fine, just checking)

  2. Was there any useful information in instances/micro/stderr.txt or stdout.txt?

  3. Finally, could you try running muscle_manager --log-level=DEBUG --start-all rd_implementations.ymmsl rd_cpp_mpi.ymmsl rd_settings.ymmsl? That should give a lot more output in the manager log describing what MUSCLE3 and QCG-PJ are doing exactly. If you don't want to paste it here, feel free to email it to me directly (l (dot) veen (at) esciencecenter.nl)

  4. I guess the openmpi and intelmpi options don't work on this machine, and that's why you're trying srunmpi?

Thanks in advance!

wroa commented 1 year ago

Hi Lourens,

Thank you for looking into the issue. To answer your questions:

1) I was in the middle of debugging an application which I have been developing (a version of SCEMa using MUSCLE) so I did not want to complicate the report with unnecessary information and instead make the report more general. I ended up copying the example code to my project structure to try and get the skeleton working before slowly adding in the additional functionality.

2) The files stderr.txt, stdout.txt located in the macro model directory are both empty.

3) I have sent an email with the compressed output of running the log-level flag.

4) The cluster ARCHER2 only has srun, it does not have mpirun or mpiexec. The openmpi mode (I think) uses mpirun under the hood. Similarly, ARCHER2 runs on AMD cores and does not have the intel compiler suite.

Once again thank you for the help.

On 11 Dec 2022, at 18:49, Lourens Veen @.***> wrote:

⚠ Caution: External sender

Hi Werner, Thanks for reporting this issue! HPC support is not so easy to test, and you may be the first to actually give srunmpi a try. I do think I remember Piotr saying it was problematic on some machines, but this actually looks like it may be an issue in MUSCLE3 rather than with the machine. It seems that the QCG-PJ agent is trying to execute taskset -c 0 ..../build/X. This command would be what I'd expect to see when starting a single-threaded model, so it may be that it's actually MUSCLE3 that's not passing on the execution model to QCG-PJ correctly. A few questions that may help to figure out what's going on: • Did you replace the actual path to the executable in the QCG-PJ log with X? (that's fine, just checking) • Was there any useful information in instances/micro/stderr.txt or stdout.txt? • Finally, could you try running muscle_manager --log-level=DEBUG --start-all rd_implementations.ymmsl rd_cpp_mpi.ymmsl rd_settings.ymmsl? That should give a lot more output in the manager log describing what MUSCLE3 and QCG-PJ are doing exactly. If you don't want to paste it here, feel free to email it to me directly (l (dot) veen (at) esciencecenter.nl) • I guess the openmpi and intelmpi options don't work on this machine, and that's why you're trying srunmpi? Thanks in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

LourensVeen commented 1 year ago

I've discovered one issue, which is that --overlap must be passed to srun for it to be willing to start multiple processes on the same cores. Since QCG-PJ starts its agents using srun, those cores are occupied already as far as srun is concerned, and so no instances can be started on them without --overlap, which QCG-PJ doesn't add.

I've added a patch to do this from the MUSCLE3 side, it may not solve the problem but it should help at least. To be released tomorrow unless disaster strikes.

LourensVeen commented 1 year ago

The above fix was released with 0.6.0. Could you perhaps try it out and see if it helps? It's quite possible that there are other issues as well, but you never know :-).

wroa commented 1 year ago

I just tested out the new version (0.6.0). Unfortunately, I still get a similar error to the one of the previous version. When the micro model is spawned, it crashes, taking the whole simulation with it. Using the C++ example, the output to standard out is as follows:


Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.4

Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: can't pickle traceback objects
An error occurred during execution, and the simulation was
shut down. The manager log should tell you what happened.
You can find it at
/mnt/lustre/a2fs-work2/work/e723/e723/werner22/muscle3/Example/ymmsl/run_reaction_diffusion_cpp_mpi_20230118_101754/muscle3_manager.log

and the output of the .../muscle3_manager.log is attached here.