open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.13k stars 858 forks source link

MTT failure on IBM #10969

Open rhc54 opened 1 year ago

rhc54 commented 1 year ago

The following failure is being reported from both OMPI v5.0 and main branches:

[ip-172-31-10-245:11280] Unable to extract peer [[26806,1],132] nodeid from the modex.
[ip-172-31-10-245:11280] Unable to extract peer [[26806,1],133] nodeid from the modex.
[ip-172-31-10-245:11280] Unable to extract peer [[26806,1],134] nodeid from the modex.
[ip-172-31-10-245:11280] Unable to extract peer [[26806,1],135] nodeid from the modex.
[ip-172-31-10-245:11280] Unable to extract peer [[26806,1],136] nodeid from the modex.
[ip-172-31-10-245:11280] Unable to extract peer [[26806,1],137] nodeid from the modex.
....

PRRTE is providing the nodeid for every proc in the job as part of the initial job info - it is therefore not included in the modex. However, I cannot find the location where this error message is emitted, and so I don't know the precise function call that generated it.

Could someone please provide me with further info as to how this error is generated?

The command executed is: mpirun -n 144 topology/distgraph1, if that helps (remember, I do not have access to the ompi-tests repository)

rhc54 commented 1 year ago

Actually, I now see the same errors reported from IBM - same application - so I'm editing the title to reflect it.

rhc54 commented 1 year ago

I was able to find the source of the message - it is in OMPI itself, actually here, line 175. I then created a PMIx-based code for testing it and found that it worked just fine:

$ prterun --map-by ppr:2:node ./nodeid | sort
[prterun-rhc-node01-55007@1:0] Peer 0 is running on node 0
[prterun-rhc-node01-55007@1:0] Peer 1 is running on node 1
[prterun-rhc-node01-55007@1:0] Peer 2 is running on node 2
[prterun-rhc-node01-55007@1:0] Peer 3 is running on node 3
[prterun-rhc-node01-55007@1:0] Peer 4 is running on node 4
[prterun-rhc-node01-55007@1:0] Peer 5 is running on node 0
[prterun-rhc-node01-55007@1:0] Peer 6 is running on node 1
[prterun-rhc-node01-55007@1:0] Peer 7 is running on node 2
[prterun-rhc-node01-55007@1:0] Peer 8 is running on node 3
[prterun-rhc-node01-55007@1:0] Peer 9 is running on node 4
[prterun-rhc-node01-55007@1:0]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:5]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:7]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:6]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:1]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:2]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:8]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:9]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:4]: Successfully retrieved all nodeids
[prterun-rhc-node01-55007@1:3]: Successfully retrieved all nodeids

The code is very simple:

static pmix_proc_t myproc;

int main(int argc, char **argv)
{
    pmix_status_t rc;
    pid_t pid;
    char hostname[1024];
    pmix_value_t *val;
    uint32_t jobsize, nodeid;
    size_t n;
    pmix_proc_t proc, wildcard;

    pid = getpid();
    gethostname(hostname, 1024);

    /* init us - note that the call to "init" includes the return of
     * any job-related info provided by the RM */
    if (PMIX_SUCCESS != (rc = PMIx_Init(&myproc, NULL, 0))) {
        fprintf(stderr, "[%s:%lu] PMIx_Init failed: %s\n",
                hostname, (unsigned long)pid, PMIx_Error_string(rc));
        exit(0);
    }
    PMIX_LOAD_PROCID(&wildcard, myproc.nspace, PMIX_RANK_WILDCARD);

    /* get our job size */
    if (PMIX_SUCCESS != (rc = PMIx_Get(&wildcard, PMIX_JOB_SIZE, NULL, 0, &val))) {
        fprintf(stderr, "[%s:%u] PMIx_Get job size failed: %s\n", myproc.nspace,
                myproc.rank, PMIx_Error_string(rc));
        goto done;
    }
    PMIX_VALUE_GET_NUMBER(rc, val, jobsize, uint32_t);
    if (PMIX_SUCCESS != rc) {
        fprintf(stderr, "[%s:%u] Got bad job size: %s\n",
                myproc.nspace, myproc.rank, PMIx_Error_string(rc));
        goto done;
    }
    PMIX_VALUE_RELEASE(val);

    /* get the nodeid of all our peers */
    PMIX_LOAD_NSPACE(proc.nspace, myproc.nspace);
    for (n=0; n < jobsize; n++) {
        proc.rank = n;
        rc = PMIx_Get(&proc, PMIX_NODEID, NULL, 0, &val);
        if (PMIX_SUCCESS != rc) {
            fprintf(stderr, "[%s:%u] PMIx_Get failed for nodeid on rank %u: %s\n",
                    myproc.nspace, myproc.rank, n, PMIx_Error_string(rc));
            break;
        }
        PMIX_VALUE_GET_NUMBER(rc, val, nodeid, uint32_t);
        if (PMIX_SUCCESS != rc) {
            fprintf(stderr, "[%s:%u] Got bad nodeid for rank %u: %s\n",
                    myproc.nspace, myproc.rank, n, PMIx_Error_string(rc));
            goto done;
        }
        if (0 == myproc.rank) {
            fprintf(stderr, "[%s:%u] Peer %u is running on node %u\n",
                    myproc.nspace, myproc.rank, n, nodeid);
        }
        PMIX_VALUE_RELEASE(val);
    }

    fprintf(stderr, "[%s:%u]: Successfully retrieved all nodeids\n",
            myproc.nspace, myproc.rank);

done:
    /* finalize us */
    if (PMIX_SUCCESS != (rc = PMIx_Finalize(NULL, 0))) {
        fprintf(stderr, "Client ns %s rank %d:PMIx_Finalize failed: %s\n", myproc.nspace,
                myproc.rank, PMIx_Error_string(rc));
    }
    fflush(stderr);
    return (0);
}

What I cannot tell is whether or not OMPI is passing the correct info in the proc name - my suspicion is "no" based on this test. Can someone please look into the OMPI code?

wenduwan commented 4 months ago

Checked MTT history. The test has been passing.