open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.13k stars 858 forks source link

Mystery error on exit #12607

Closed PhilipDeegan closed 3 months ago

PhilipDeegan commented 3 months ago

Hi there,

I'm trying to execute a number of mpirun -n 2 --bind-to none ... processes at the same time on a somewhat high CPU count machine. I do not believe I'm running more processes than there are cores available, yet in some instances, which I can't reproduce locally, there seems to be an issue on process shutdown.

To me it looks like the process is finished, and that there is no issue, but there's still is reported an issue. With a non-zero exit code, which is the annoying part here.

Is there the potential for some issue from running a large number of concurrent, but unrelated mpirun processes?

I'm open to the fact that there might be an issue in the shutdown code we have, but we also do have some mpi barriers in place to encourage process lock-stepping

Error log.


15:50:32   ----------------------------------------------------------------------
15:50:32   Ran 1 test in 431.337s
15:50:32   
15:50:32   OK
15:50:32   .
15:50:32   ----------------------------------------------------------------------
15:50:32   Ran 1 test in 431.563s
15:50:32   
15:50:32   OK
15:50:32   --------------------------------------------------------------------------
15:50:32   mpirun has exited due to process rank 1 with PID 0 on
15:50:32   node 1010f01d647c exiting improperly. There are three reasons this could occur:
15:50:32   
15:50:32   1. this process did not call "init" before exiting, but others in
15:50:32   the job did. This can cause a job to hang indefinitely while it waits
15:50:32   for all processes to call "init". By rule, if one process calls "init",
15:50:32   then ALL processes must call "init" prior to termination.
15:50:32   
15:50:32   2. this process called "init", but exited without calling "finalize".
15:50:32   By rule, all processes that call "init" MUST call "finalize" prior to
15:50:32   exiting or it will be considered an "abnormal termination"
15:50:32   
15:50:32   3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
15:50:32   orte_create_session_dirs is set to false. In this case, the run-time cannot
15:50:32   detect that the abort call was an abnormal termination. Hence, the only
15:50:32   error message you will receive is this one.
15:50:32   
15:50:32   This may have caused other processes in the application to be
15:50:32   terminated by signals sent by mpirun (as reported here).
15:50:32   
15:50:32   You can avoid this message by specifying -quiet on the mpirun command line.```
rhc54 commented 3 months ago

I'm unaware of any limitation on number of concurrent mpiruns, but I don't really understand what you are trying to do. A far cleaner way of doing this would be to start the PRRTE DVM (just prte) and then use prun to launch the individual jobs. Avoids all the overhead of starting the RTE over and over again, and loading the file system with creating and removing all the session directories for each of those mpirun instances.

Setting that aside, all the output is telling you is that one of your processes didn't exit properly - likely failed to call MPI_Finalize before terminating. You'd get a different error message if it had segfault'd, so I suspect that isn't what happened. Probably just something that triggered an error escape in your job.

PhilipDeegan commented 3 months ago

to start the PRRTE DVM (just prte) and then use prun to launch

I am not experienced in any of this, so its' not something that I know much about, but can look into it

all the output is telling you is that one of your processes didn't exit properly

Yes I can see that, the issue here is that from my point of view, it shouldn't be happening, and it only happens sometimes. With no real clear indication of what's happening or why

Anyway, I haven't seen it since upgrading from fedora 39, to 40, so hopefully it's transient

hominhquan commented 3 months ago

This may be related to https://github.com/open-mpi/ompi/issues/10117 ?

rhc54 commented 3 months ago

No - totally unrelated unless you see your procs are crashing, which isn't what you report. It sounds to me like the issue is something in your integration with the OS if upgrading fedora solves the problem. I very much doubt it is something in OMPI causing you to exit improperly - that would almost always show as a segfault.