Closed PhilipDeegan closed 3 months ago
I'm unaware of any limitation on number of concurrent mpiruns, but I don't really understand what you are trying to do. A far cleaner way of doing this would be to start the PRRTE DVM (just prte
) and then use prun
to launch the individual jobs. Avoids all the overhead of starting the RTE over and over again, and loading the file system with creating and removing all the session directories for each of those mpirun instances.
Setting that aside, all the output is telling you is that one of your processes didn't exit properly - likely failed to call MPI_Finalize
before terminating. You'd get a different error message if it had segfault'd, so I suspect that isn't what happened. Probably just something that triggered an error escape in your job.
to start the PRRTE DVM (just prte) and then use prun to launch
I am not experienced in any of this, so its' not something that I know much about, but can look into it
all the output is telling you is that one of your processes didn't exit properly
Yes I can see that, the issue here is that from my point of view, it shouldn't be happening, and it only happens sometimes. With no real clear indication of what's happening or why
Anyway, I haven't seen it since upgrading from fedora 39, to 40, so hopefully it's transient
This may be related to https://github.com/open-mpi/ompi/issues/10117 ?
No - totally unrelated unless you see your procs are crashing, which isn't what you report. It sounds to me like the issue is something in your integration with the OS if upgrading fedora solves the problem. I very much doubt it is something in OMPI causing you to exit improperly - that would almost always show as a segfault.
Hi there,
I'm trying to execute a number of
mpirun -n 2 --bind-to none ...
processes at the same time on a somewhat high CPU count machine. I do not believe I'm running more processes than there are cores available, yet in some instances, which I can't reproduce locally, there seems to be an issue on process shutdown.To me it looks like the process is finished, and that there is no issue, but there's still is reported an issue. With a non-zero exit code, which is the annoying part here.
Is there the potential for some issue from running a large number of concurrent, but unrelated mpirun processes?
I'm open to the fact that there might be an issue in the shutdown code we have, but we also do have some mpi barriers in place to encourage process lock-stepping
Error log.