Closed JiakunYan closed 4 weeks ago
Can you confirm your MPICH library and hello_world
are linked with the Slurm PMI library? The output suggests each process thinks it is a singleton, so something is wrong in the discovery of other processes in the job.
According to the output of ldd, it seems it did link to the slurm pmi library.
srun -n 1 ldd ~/workspace/hpx-lci_scripts/spack_env/expanse/hpx-lcw/.spack-env/view/bin/hello_world linux-vdso.so.1 (0x0000155555551000) liblcw.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lcw-master-qdc2ohhyw7cfzumwivkojiilsto66qlh/lib64/liblcw.so (0x000015555511a000) libstdc++.so.6 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libstdc++.so.6 (0x0000155554d47000) libm.so.6 => /lib64/libm.so.6 (0x00001555549c5000) libgcc_s.so.1 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libgcc_s.so.1 (0x00001555547ac000) libc.so.6 => /lib64/libc.so.6 (0x00001555543e7000) liblci.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblci.so (0x00001555541c1000) liblct.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblct.so (0x0000155553f78000) libibverbs.so.1 => /lib64/libibverbs.so.1 (0x0000155553d58000) libmpicxx.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o/lib/libmpicxx.so.0 (0x0000155553b35000) libmpi.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o/lib/libmpi.so.0 (0x00001555534a2000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000155553282000) /lib64/ld-linux-x86-64.so.2 (0x0000155555325000) liblci-ucx.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblci-ucx.so (0x0000155553011000) libnl-route-3.so.200 => /lib64/libnl-route-3.so.200 (0x0000155552d7f000) libnl-3.so.200 => /lib64/libnl-3.so.200 (0x0000155552b5c000) libdl.so.2 => /lib64/libdl.so.2 (0x0000155552958000) libhwloc.so.15 => /home/jackyan1/opt/hwloc/2.9.1/lib/libhwloc.so.15 (0x00001555526f9000) libpciaccess.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libpciaccess-0.17-jqqzmoorywzwslxnvh3whvxmxgggxddg/lib/libpciaccess.so.0 (0x00001555524ef000) libxml2.so.2 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libxml2-2.10.3-riigwi634oahw6njkyhbrhqjx2hsbjyt/lib/libxml2.so.2 (0x0000155552184000) libucp.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucp.so.0 (0x0000155551eb6000) libucs.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucs.so.0 (0x0000155551c55000) libyaksa.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/yaksa-0.2-3r62jn5cdiiovsmntoqdrkzircgzvxqh/lib/libyaksa.so.0 (0x000015554f989000) libxpmem.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/xpmem-2.6.5-36-n47tincumvgfjwbnhddzsqskzs7nxohd/lib/libxpmem.so.0 (0x000015554f786000) librt.so.1 => /lib64/librt.so.1 (0x000015554f57e000) libpmi.so.0 => /cm/shared/apps/slurm/current/lib64/libpmi.so.0 (0x000015554f378000) libz.so.1 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/zlib-1.2.13-xhijn7cz7apogelukw47ulnzhhardvos/lib/libz.so.1 (0x000015554f160000) liblzma.so.5 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/xz-5.4.1-knnmdfklcssmtvciq4pupvfqsh2upbzy/lib/liblzma.so.5 (0x000015554ef33000) libiconv.so.2 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libiconv-1.17-fdzdmyikb3i5dtfkt26raiyq63tumvnq/lib/libiconv.so.2 (0x000015554ec26000) libuct.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libuct.so.0 (0x000015554e9eb000) libnuma.so.1 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/numactl-2.0.14-k3pqb32bk6b5sl2c7kvzd6errjicvsye/lib/libnuma.so.1 (0x000015554e7df000) libucm.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucm.so.0 (0x000015554e5c4000) libslurm_pmi.so => /cm/shared/apps/slurm/23.02.7/lib64/slurm/libslurm_pmi.so (0x000015554e1d2000) libresolv.so.2 => /lib64/libresolv.so.2 (0x000015554dfba000) libatomic.so.1 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libatomic.so.1 (0x000015554ddb2000)
I also tried the --mpi=pmi2
option of srun
and got a different error:
srun -n 2 --mpi=pmi2 hello_world [cli_0]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=20D0539CE2EDC88B822000637577CC2B3200F8D74F[5]4F030088C70230AC7B6CE151210600020A150136CF1917B7513832A8384F9241AF360113230082BB9321060002C6CA67C9CF1917B7513832A8384F9241AF360013230082C8B7211201028F9A9BAD3BDF1A8A98[2]F0[4]CF1917B75138539E3E4BD7E037370113230082C119210600020A160136CF1917B751383A0B345023ADAE360113230082A92342088F9A9BAD3BDF1A0A8D5377CC2B32[2]705077CCAB33004F2300883B808067[4]43088F9A9BAD3BDF1A0AA7D377CC2B32[2]705077CCAB33004F2300882D75180006[2]C0241148[10]FFFF0A1501367E3977CC2B338142364F8797713526DF230007AB73014A0C[2]278AB00FA1338142364F5C7C613526DF22[2]7AD477CC2B338142364F5C7C613526DF2200010031D15C7CE1338142364FF28969352603270003C27301B30F77CCAB338142364FF28969352603270083C8730125030094057E3977CC2B33EF483850A2E73B3532DF2300072B9701490C[2]278AB00FA133EF48385077CC2B3532DF22[2]7AD477CC2B33EF48385077CC2B3532DF2200010031D15C7CE133EF4838500DDA33353203270003419701B30F77CCAB33EF4838500DDA3335320327008347970126088F9A9BAD3BDF1A0A478BBD3706360024: [cli_1]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-1-seg-1/2 value=2056E44C658714485C2000637577CC2B3200F8D74F[5]4F0300884C479A1F197DC693210600020A150136CF1917B7513832A8384F9241AF360113230082E5BD21060002C6CA67C9CF1917B7513832A8384F9241AF36001323008293FD211201028F9A9BAD3BDF1A8A98[2]F0[4]CF1917B75138539E3E4BD7E037370113230082854F210600020A160136CF1917B751383A0B345023ADAE36011323008295F342088F9A9BAD3BDF1A0A8D5377CC2B32[2]705077CCAB33004F2300883C808067[4]43088F9A9BAD3BDF1A0AA7D377CC2B32[2]705077CCAB33004F2300882E75180006[2]C0241148[10]FFFF0A1501367E3977CC2B338142364F8797713526DF230007AC7301490C[2]278AB00FA1338142364F5C7C613526DF22[2]7AD477CC2B338142364F5C7C613526DF2200010031D15C7CE1338142364FF28969352603270003C17301B30F77CCAB338142364FF28969352603270083C7730125030094057E3977CC2B33EF483850A2E73B3532DF2300072A97014B0C[2]278AB00FA133EF48385077CC2B3532DF22[2]7AD477CC2B33EF48385077CC2B3532DF2200010031D15C7CE133EF4838500DDA33353203270003409701B30F77CCAB33EF4838500DDA3335320327008346970126088F9A9BAD3BDF1A0A478BBD3706360024: ^Csrun: interrupt (one more within 1 sec to abort) srun: StepId=28943582.7 tasks 0-1: running ^Csrun: sending Ctrl-C to StepId=28943582.7 srun: forcing job termination srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
The srun --mpi=pmi2
is working. But looks like the exchange address string gets too long to fit the PMI message limit. Not sure where the inconsistency comes from.
For the error reported when srun --mpi=pmi2
, manually modifying MPICH source code and reducing pmi_max_val_size
by half fixed this issue. I would appreciate it if MPICH could provide an environmental variable for users to control the value (like the I_MPI_PMI_VALUE_LENGTH_MAX
environmental variable in impi
)
FWIW, a simple Slurm+PMI2 example putting a max size value does not hang on the Bebop cluster here at Argonne. There may still be a bug in the segmented put in the MPIR layer for PMI2. Will investigate further...
#include <slurm/pmi2.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
int has_parent, size, rank, appnum;
int pmi_max_val = PMI2_MAX_VALLEN;
int pmi_max_key = PMI2_MAX_KEYLEN;
PMI2_Init(&has_parent, &size, &rank, &appnum);
char *pmi_kvs_name = malloc(pmi_max_val);
PMI2_Job_GetId(pmi_kvs_name, pmi_max_val);
char *valbuf = malloc(pmi_max_val);
memset(valbuf, 'a', pmi_max_val);
valbuf[pmi_max_val - 1] = '\0';
PMI2_KVS_Put("foo", valbuf);
PMI2_KVS_Fence();
int out_len;
PMI2_KVS_Get(pmi_kvs_name, PMI2_ID_NULL, "bar", NULL, 0, &out_len);
PMI2_Finalize();
return 0;
}
What Jiakun pointed out is it's likely the PMI2_MAX_VALLEN
in pmi2.h
is too big. It is 1024
historically. When exchange addresses and the address length is too big, we segment it according to PMI2_MAX_VALLEN
, that apparently overflows the libmpi2
in Slurm.
So I think the right solution is to fix PMI2_MAX_VALLEN
in pmi2.h
. The header should be consistent with the library libpmi2.so
.
If we want to add a environment override, it should be named PMI2_MAX_VALLEN
, IMO.
@JiakunYan Does the example in https://github.com/pmodels/mpich/issues/6924#issuecomment-2127357799 work on SDSC Expanse?
So I think the right solution is to fix
PMI2_MAX_VALLEN
inpmi2.h
. The header should be consistent with the librarylibpmi2.so
.
I am saying that the library accepts a value equal to the maximum value length on Bebop. We should confirm it can do the same on Expanse before we say this is a bug in the header.
FWIW, a simple Slurm+PMI2 example putting a max size value does not hang on the Bebop cluster here at Argonne. There may still be a bug in the segmented put in the MPIR layer for PMI2. Will investigate further...
Make sure to check the return from the PMI2 functions. They may be errors.
Here's output from a modified example that puts and the gets the key.
#include <slurm/pmi2.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
int main(void)
{
int has_parent, size, rank, appnum;
int pmi_max_val = PMI2_MAX_VALLEN;
int pmi_max_key = PMI2_MAX_KEYLEN;
PMI2_Init(&has_parent, &size, &rank, &appnum);
char *pmi_kvs_name = malloc(pmi_max_val);
PMI2_Job_GetId(pmi_kvs_name, pmi_max_val);
char *valbuf = malloc(pmi_max_val + 1);
memset(valbuf, 'a', pmi_max_val);
valbuf[pmi_max_val] = '\0';
assert(strlen(valbuf) <= pmi_max_val);
printf("vallen = %d, max = %d\n", strlen(valbuf), pmi_max_val);
PMI2_KVS_Put("foo", valbuf);
printf("put %s into kvs\n", valbuf);
PMI2_KVS_Fence();
int out_len;
char *outbuf = malloc(pmi_max_val + 1);
PMI2_KVS_Get(pmi_kvs_name, PMI2_ID_NULL, "foo", outbuf, pmi_max_val + 1, &out_len);
printf("out_len = %d, strlen = %d\n", out_len, strlen(outbuf));
printf("got %s from kvs\n", outbuf);
PMI2_Finalize();
free(valbuf);
free(outbuf);
free(pmi_kvs_name);
return 0;
}
[raffenet@beboplogin4]~% srun --mpi=pmi2 ./a.out
vallen = 1024, max = 1024
put aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa into kvs
out_len = 1024, strlen = 1024
got aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa from kvs
[cli_0]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=20D0539CE2...
I suspect the PMI2_MAX_VALLEN
didn't account for the size of overhead, i.e. cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=
. Again this should be accounted for by the libpmi
implementations since the upper layer is not aware of the internal protocols. I am not against allowing users to overwrite with the environment, but it should be commented with clear reasons.
Here's output from a modified example that puts and the gets the key.
Yeah, this seems to support that Slurm is able accommodate 1024 value size -- its internal buffer size must be even larger.
libpmi.so.0 => /cm/shared/apps/slurm/current/lib64/libpmi.so.0 (0x000015554f378000)
Oh, Just realized @JiakunYan was linking with PMI-1 rather than PMI-2. @raffenet Need test Slurm's PMI-1
@raffenet @hzhou https://pm.bsc.es/gitlab/rarias/bscpkgs/-/issues/126 might be helpful in explaining the situation.
I think it is related to the pmi implementation of specific slurm versions.
@JiakunYan thanks, that is helpful. I probably need to run on multiple nodes to trigger the problem. Will try again when I have a chance. It would be good, IMO, to submit a ticket with a PMI-only reproducer to Slurm.
I am glad to find that the newest MPICH no longer has this issue. Thanks!
I am getting the following error when trying to run MPICH on SDSC Expanse (Infiniband machine with slurm).
mpichversion
outputAny idea why this could happen?