Open rhc54 opened 1 year ago
Actually, I now see the same errors reported from IBM - same application - so I'm editing the title to reflect it.
I was able to find the source of the message - it is in OMPI itself, actually here, line 175. I then created a PMIx-based code for testing it and found that it worked just fine:
$ prterun --map-by ppr:2:node ./nodeid | sort
[prterun-rhc-node01-55007@1:0] Peer 0 is running on node 0
[prterun-rhc-node01-55007@1:0] Peer 1 is running on node 1
[prterun-rhc-node01-55007@1:0] Peer 2 is running on node 2
[prterun-rhc-node01-55007@1:0] Peer 3 is running on node 3
[prterun-rhc-node01-55007@1:0] Peer 4 is running on node 4
[prterun-rhc-node01-55007@1:0] Peer 5 is running on node 0
[prterun-rhc-node01-55007@1:0] Peer 6 is running on node 1
[prterun-rhc-node01-55007@1:0] Peer 7 is running on node 2
[prterun-rhc-node01-55007@1:0] Peer 8 is running on node 3
[prterun-rhc-node01-55007@1:0] Peer 9 is running on node 4
[prterun-rhc-node01-55007@1:0]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:5]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:7]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:6]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:1]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:2]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:8]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:9]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:4]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:3]: Successfully retrieved all nodeids
The code is very simple:
static pmix_proc_t myproc;
int main(int argc, char **argv)
{
pmix_status_t rc;
pid_t pid;
char hostname[1024];
pmix_value_t *val;
uint32_t jobsize, nodeid;
size_t n;
pmix_proc_t proc, wildcard;
pid = getpid();
gethostname(hostname, 1024);
/* init us - note that the call to "init" includes the return of
* any job-related info provided by the RM */
if (PMIX_SUCCESS != (rc = PMIx_Init(&myproc, NULL, 0))) {
fprintf(stderr, "[%s:%lu] PMIx_Init failed: %s\n",
hostname, (unsigned long)pid, PMIx_Error_string(rc));
exit(0);
}
PMIX_LOAD_PROCID(&wildcard, myproc.nspace, PMIX_RANK_WILDCARD);
/* get our job size */
if (PMIX_SUCCESS != (rc = PMIx_Get(&wildcard, PMIX_JOB_SIZE, NULL, 0, &val))) {
fprintf(stderr, "[%s:%u] PMIx_Get job size failed: %s\n", myproc.nspace,
myproc.rank, PMIx_Error_string(rc));
goto done;
}
PMIX_VALUE_GET_NUMBER(rc, val, jobsize, uint32_t);
if (PMIX_SUCCESS != rc) {
fprintf(stderr, "[%s:%u] Got bad job size: %s\n",
myproc.nspace, myproc.rank, PMIx_Error_string(rc));
goto done;
}
PMIX_VALUE_RELEASE(val);
/* get the nodeid of all our peers */
PMIX_LOAD_NSPACE(proc.nspace, myproc.nspace);
for (n=0; n < jobsize; n++) {
proc.rank = n;
rc = PMIx_Get(&proc, PMIX_NODEID, NULL, 0, &val);
if (PMIX_SUCCESS != rc) {
fprintf(stderr, "[%s:%u] PMIx_Get failed for nodeid on rank %u: %s\n",
myproc.nspace, myproc.rank, n, PMIx_Error_string(rc));
break;
}
PMIX_VALUE_GET_NUMBER(rc, val, nodeid, uint32_t);
if (PMIX_SUCCESS != rc) {
fprintf(stderr, "[%s:%u] Got bad nodeid for rank %u: %s\n",
myproc.nspace, myproc.rank, n, PMIx_Error_string(rc));
goto done;
}
if (0 == myproc.rank) {
fprintf(stderr, "[%s:%u] Peer %u is running on node %u\n",
myproc.nspace, myproc.rank, n, nodeid);
}
PMIX_VALUE_RELEASE(val);
}
fprintf(stderr, "[%s:%u]: Successfully retrieved all nodeids\n",
myproc.nspace, myproc.rank);
done:
/* finalize us */
if (PMIX_SUCCESS != (rc = PMIx_Finalize(NULL, 0))) {
fprintf(stderr, "Client ns %s rank %d:PMIx_Finalize failed: %s\n", myproc.nspace,
myproc.rank, PMIx_Error_string(rc));
}
fflush(stderr);
return (0);
}
What I cannot tell is whether or not OMPI is passing the correct info in the proc name - my suspicion is "no" based on this test. Can someone please look into the OMPI code?
Checked MTT history. The test has been passing.
The following failure is being reported from both OMPI v5.0 and main branches:
PRRTE is providing the nodeid for every proc in the job as part of the initial job info - it is therefore not included in the modex. However, I cannot find the location where this error message is emitted, and so I don't know the precise function call that generated it.
Could someone please provide me with further info as to how this error is generated?
The command executed is:
mpirun -n 144 topology/distgraph1
, if that helps (remember, I do not have access to the ompi-tests repository)