pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
564 stars 279 forks source link

cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:183): assert (s) failed #7195

Open jeffhammond opened 3 weeks ago

jeffhammond commented 3 weeks ago

I am using 0b1a4ba6995c28e1c9f797b585df1f83bc56b2b6 and see this error with every MPI test, but not with hostname:

[proxy:0@host] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:183): assert (s) failed

There is no impact to MPI test behavior that I can see.

I built MPICH with NVHPC 24.9 compilers and UCX.

mpichversion
MPICH Version:      4.3.0a1
MPICH Release date: unreleased development copy
MPICH ABI:          0:0:0
MPICH Device:       ch4:ucx
MPICH configure:    CC=nvc CXX=nvc++ FC=nvfortran --with-device=ch4:ucx --prefix=~/MPI/mpich-nvhpc-ch4-ucx-install
MPICH CC:           nvc     --diag_suppress=branch_past_initialization -O2
MPICH CXX:          nvc++
MPICH F77:          nvfortran
MPICH FC:           nvfortran
MPICH features:     threadcomm

I ran rm -rf /tmp/* just in case there were some remnants there, but saw no change.

jeffhammond commented 3 weeks ago

e.g. cpi shows the asserts, but still succeeds.

~/mpich-nvhpc-ch4-ucx-install/bin/mpirun -n 4 ./cpi
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:183): assert (s) failed
Process 0 of 4 is on oppenheimer
Process 1 of 4 is on oppenheimer
Process 3 of 4 is on oppenheimer
Process 2 of 4 is on oppenheimer
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.000027
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:183): assert (s) failed
jeffhammond commented 2 weeks ago

This is the result with printf of the two arguments to the assert that fails...

~/MPI/mpich/build$ ~/MPI/mpich-nvhpc-ch4-ucx-install/bin/mpirun -n 4 ./examples/cpi
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:185): assert (s) failed
Process 0 of 4 is on oppenheimer
Process 1 of 4 is on oppenheimer
Process 2 of 4 is on oppenheimer
Process 3 of 4 is on oppenheimer
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.000027
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:185): assert (s) failed
s=�_�
status=0
s=(null)
status=0
s=(null)
status=0
jeffhammond commented 2 weeks ago

This is the result with printf of the two arguments to the assert that fails...

~/MPI/mpich/build$ ~/MPI/mpich-nvhpc-ch4-ucx-install/bin/mpirun -n 4 ./examples/cpi
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:185): assert (s) failed
Process 0 of 4 is on oppenheimer
Process 1 of 4 is on oppenheimer
Process 2 of 4 is on oppenheimer
Process 3 of 4 is on oppenheimer
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.000027
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:185): assert (s) failed
s=@��
status=(null)
s=(null)
status=(null)
s=(null)
status=(null)
jeffhammond commented 2 weeks ago

I'm not really sure what I am looking at, but the only value I can reason about is *p=-allgather-shm-1-0, which comes from this:

    while (p) {
        struct pmip_kvs *s = NULL;
        HASH_FIND_STR(pg->kvs, *p, s);
        if (s==0) {
            printf("s=%p\n",s);
            printf("pg->kvs=%p\n",pg->kvs);
            printf("p=%p\n",p);
            printf("*p=%s\n",*p);
            printf("status=%d\n",status);
        }
        HYDU_ASSERT(s, status);
        PMIU_cmd_add_str(&pmi, s->key, s->val);

        p = (const char **) utarray_next(pg->kvs_batch, p);
    }