Open jeffhammond opened 3 weeks ago
e.g. cpi
shows the asserts, but still succeeds.
~/mpich-nvhpc-ch4-ucx-install/bin/mpirun -n 4 ./cpi
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:183): assert (s) failed
Process 0 of 4 is on oppenheimer
Process 1 of 4 is on oppenheimer
Process 3 of 4 is on oppenheimer
Process 2 of 4 is on oppenheimer
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.000027
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:183): assert (s) failed
This is the result with printf
of the two arguments to the assert that fails...
~/MPI/mpich/build$ ~/MPI/mpich-nvhpc-ch4-ucx-install/bin/mpirun -n 4 ./examples/cpi
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:185): assert (s) failed
Process 0 of 4 is on oppenheimer
Process 1 of 4 is on oppenheimer
Process 2 of 4 is on oppenheimer
Process 3 of 4 is on oppenheimer
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.000027
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:185): assert (s) failed
s=�_�
status=0
s=(null)
status=0
s=(null)
status=0
This is the result with printf
of the two arguments to the assert that fails...
~/MPI/mpich/build$ ~/MPI/mpich-nvhpc-ch4-ucx-install/bin/mpirun -n 4 ./examples/cpi
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:185): assert (s) failed
Process 0 of 4 is on oppenheimer
Process 1 of 4 is on oppenheimer
Process 2 of 4 is on oppenheimer
Process 3 of 4 is on oppenheimer
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.000027
[proxy:0@oppenheimer] cache_put_flush (../../../../src/pm/hydra/proxy/pmip_pmi.c:185): assert (s) failed
s=@��
status=(null)
s=(null)
status=(null)
s=(null)
status=(null)
I'm not really sure what I am looking at, but the only value I can reason about is *p=-allgather-shm-1-0
, which comes from this:
while (p) {
struct pmip_kvs *s = NULL;
HASH_FIND_STR(pg->kvs, *p, s);
if (s==0) {
printf("s=%p\n",s);
printf("pg->kvs=%p\n",pg->kvs);
printf("p=%p\n",p);
printf("*p=%s\n",*p);
printf("status=%d\n",status);
}
HYDU_ASSERT(s, status);
PMIU_cmd_add_str(&pmi, s->key, s->val);
p = (const char **) utarray_next(pg->kvs_batch, p);
}
I am using 0b1a4ba6995c28e1c9f797b585df1f83bc56b2b6 and see this error with every MPI test, but not with
hostname
:There is no impact to MPI test behavior that I can see.
I built MPICH with NVHPC 24.9 compilers and UCX.
I ran
rm -rf /tmp/*
just in case there were some remnants there, but saw no change.