Open rajicon opened 3 years ago
@rajicon It would be helpful to a) know your setup (system configuration, network, configure command for OMPI); and b) to have a small reproducer. Can you reproduce the issue by calling MPI_Dist_graph_create_adjacent
in a loop for example? What does the application do leading up to the error?
Thanks for the reply! Our code is quite complex, so I will try and figure out a small reproducer and get back to you. In the meantime, do you have any suggestions of what the internal error could mean? Is there anything to double check first?
Unfortunately, I am neither familiar with the Java interface, nor with the Dist-graph implementation. Maybe someone else can chime in. But since this is a fairly generic error I'm afraid it\s hard to make sense of it without more details.
While I try to find a simple version of the problem, I will explain the issue in more detail. We are working on an agent simulation project, where different cpus handle different portions of the environment. This is managed by a QuadTree, where each partition is rebalanced to make sure the amount of work each partition is doing is similar. After rebalancing, the MPI_Dist_graph_create_adjacent is called. Here is the code:
specifically line 581 createMPITopo() call, which calls createDistGraphAdjacent() on line 87.
In order to replicate this issue, running the dflockers module will lead to the error after running for a while.
This is quite involved, so I will try to isolate the issue more, but does this suggest any potential problem?
@rajicon one thing you can do is monitor the memory usage, and check if there is a correlation between the MPI error, and the nodes (or a given Java virtual machine) running out of memory.
you can also try blacklisting the topo/treematch
module and see if it helps
mpirun --mca topo ^treematch ...
Unfortunately blacklisting top/treematch did not solve the issue. I have been looking into the problem some more, and it does seem to be a memory issue. I've noticed that the error always occurs after a specific amount of rebalancing calls (MPI_Dist_graph_create_adjacent calls). I now suspect that perhaps the old mpi graphs are not getting deleted, but I'm still not sure as to why. Have you seen anything like this before?
Running out of communicator ids (CIDs) could explain this behavior.
Hi, Can you clarify this some more? How can I verify CIDs and whether we are running out?
in C
, you can use MPI_Comm_c2f(MPI_Comm)
in order to get the CID (e.g. an int
) of a given communicator.
If there is no communicator leak (e.g. MPI_Comm_free()
matches communicator creation), the CID of any newly created communicator should remain low during the application lifetime.
Just a quick comment: what @ggouaillardet says is correct, but note that it is a feature of how Open MPI works.
The MPI standard itself does not define a "communicator ID" entity, nor how to portably obtain it. What @ggouaillardet is stating is that Open MPI has a finite number of CIDs (i.e., effectively the number of concurrent communicators that can exist in an Open MPI MPI process). Using MPI_Comm_c2f()
will effectively get you the CID because Open MPI's implementation of a Fortran communicator handle is the same thing as Open MPI's CID value.
I just didn't want you to think that this method is guaranteed to work in other MPI implementations.
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.0a1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.4a43c39c89037f52b4e25927e58caf08f3707c33 opal/mca/hwloc/hwloc2/hwloc (hwloc-2.1.0rc2-53-g4a43c39c) 8283e81d1c0fd078e0f7fa85a383b633c328254b opal/mca/pmix/pmix4x/openpmix (v1.1.3-2431-g8283e81d) 66c73f74cc4afd4ead5454771d98b5f199b7fe0e prrte (dev-30650-g66c73f74cc)
Please describe the system on which you are running
Details of the problem
I get the following error:
I'm using Java, and this appears after running my program for a while (so it doesn't break down the first call). Do you have any suggestions on where to start looking if an internal error like the above occurs? Any insight or tips would be appreciated!