Closed wenduwan closed 11 months ago
@amirshehataornl Do you have any insight into this?
Hmmm....well, with bind-to none
, you cannot get device distances as you are not bound to anything. This is why mpirun
didn't provide them. That said, the code in OMPI has an error in it as you cannot free the topology returned by PMIx_Load_topology
- you are just being given a pointer to the data stored in PMIx.
I can try to provide an OMPI patch for that problem.
@rhc54 Thank you!
Meanwhile the compile error with external hwloc should be addressed separately. Is howloc 1.11.8 too old to be useful? Or I guess the question is what is the oldest "supported" hwloc version?
what is the oldest "supported" hwloc version?
I wouldn't know - I believe you folks support back that far, but someone over there would have to answer that question.
sorry for the late response. I'm currently on the move.
@rhc54
Is this what you were thinking:
diff --git a/opal/mca/common/ofi/common_ofi.c b/opal/mca/common/ofi/common_ofi.c
index e882c3c833..6e03ac1be5 100644
--- a/opal/mca/common/ofi/common_ofi.c
+++ b/opal/mca/common/ofi/common_ofi.c
@@ -484,7 +484,6 @@ static int compute_dev_distances(pmix_device_distance_t **distances,
}
/* load the PMIX topology */
- PMIx_Topology_free(pmix_topo, 1);
ret = PMIx_Load_topology(pmix_topo);
if (PMIX_SUCCESS != ret) {
goto out;
@@ -497,7 +496,6 @@ static int compute_dev_distances(pmix_device_distance_t **distances,
ndist);
PMIx_Info_free(info, ninfo);
- PMIx_Topology_free(pmix_topo, 1);
out:
return ret;
}
See https://github.com/open-mpi/ompi/pull/11641 for full fix
Thank you for the fix Ralph!
@amirshehataornl Based on VERSION we still support hwloc>=1.11.0, but I've confirmed I get the compile error @wenduwan showed above (about io_first_child) when I try to compile using hwloc 1.11.0.
Do you know of an alternate way to do that loop?
Issue fixed. Closing.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
https://github.com/open-mpi/ompi/commit/42e577f1d7e39207359146594b37264a5a7a5709
We confirmed that the issue is related to this change.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
On main branch
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
InternaExternall hwlocDetails of the problem
Problem 1: Compilation error with external hwloc
Problem 2: Segfault with internal libevent & hwloc with OSU microbenchmark
In this case we did
./configure ... --with-libevent=internal --with-hwloc=internal ...
.Then we ran omb
Note: omb was build against ompi and hostfile has 2 p4d.24xlarge instances.
The segfault happens here(redacted paths for conciseness)