pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
560 stars 279 forks source link

PMI error when running on SDSC Expanse #6924

Closed JiakunYan closed 4 weeks ago

JiakunYan commented 8 months ago

I am getting the following error when trying to run MPICH on SDSC Expanse (Infiniband machine with slurm).

srun -n 2 hello_world PMII_singinit: execv failed: No such file or directory [unset]: This singleton init program attempted to access some feature [unset]: for which process manager support was required, e.g. spawn or universe_size. [unset]: But the necessary mpiexec is not in your path. PMII_singinit: execv failed: No such file or directory [unset]: This singleton init program attempted to access some feature [unset]: for which process manager support was required, e.g. spawn or universe_size. [unset]: But the necessary mpiexec is not in your path. [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_943014_0 key=PMI_mpi_memory_alloc_kinds : system msg for write_line failure : Bad file descriptor [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_943015_0 key=PMI_mpi_memory_alloc_kinds : system msg for write_line failure : Bad file descriptor exp-9-17: 0 / 1 OK exp-9-17: 0 / 1 OK ^Csrun: interrupt (one more within 1 sec to abort) srun: StepId=28936391.1 tasks 0-1: running ^Csrun: sending Ctrl-C to StepId=28936391.1 srun: forcing job termination srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: STEP 28936391.1 ON exp-9-17 CANCELLED AT 2024-02-26T11:17:29

mpichversion output

MPICH Version: 4.3.0a1 MPICH Release date: unreleased development copy MPICH ABI: 0:0:0 MPICH Device: ch4:ucx MPICH configure: --prefix=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o --disable-silent-rules --enable-shared --with-pm=no --enable-romio --without-ibverbs --enable-wrapper-rpath=yes --with-yaksa=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/yaksa-0.2-3r62jn5cdiiovsmntoqdrkzircgzvxqh --with-hwloc=/home/jackyan1/opt/hwloc/2.9.1 --with-slurm=yes --with-slurm-include=/cm/shared/apps/slurm/current/include --with-slurm-lib=/cm/shared/apps/slurm/current/lib --with-pmi=slurm --without-cuda --without-hip --with-device=ch4:ucx --with-ucx=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb --enable-libxml2 --enable-thread-cs=per-vci --with-datatype-engine=auto MPICH CC: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gcc -O2 MPICH CXX: /home/jackyan1/workspace/spack/lib/spack/env/gcc/g++ -O2 MPICH F77: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gfortran -O2 MPICH FC: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gfortran -O2 MPICH features: threadcomm

Any idea why this could happen?

raffenet commented 8 months ago

Can you confirm your MPICH library and hello_world are linked with the Slurm PMI library? The output suggests each process thinks it is a singleton, so something is wrong in the discovery of other processes in the job.

JiakunYan commented 8 months ago

According to the output of ldd, it seems it did link to the slurm pmi library.

srun -n 1 ldd ~/workspace/hpx-lci_scripts/spack_env/expanse/hpx-lcw/.spack-env/view/bin/hello_world linux-vdso.so.1 (0x0000155555551000) liblcw.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lcw-master-qdc2ohhyw7cfzumwivkojiilsto66qlh/lib64/liblcw.so (0x000015555511a000) libstdc++.so.6 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libstdc++.so.6 (0x0000155554d47000) libm.so.6 => /lib64/libm.so.6 (0x00001555549c5000) libgcc_s.so.1 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libgcc_s.so.1 (0x00001555547ac000) libc.so.6 => /lib64/libc.so.6 (0x00001555543e7000) liblci.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblci.so (0x00001555541c1000) liblct.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblct.so (0x0000155553f78000) libibverbs.so.1 => /lib64/libibverbs.so.1 (0x0000155553d58000) libmpicxx.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o/lib/libmpicxx.so.0 (0x0000155553b35000) libmpi.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o/lib/libmpi.so.0 (0x00001555534a2000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000155553282000) /lib64/ld-linux-x86-64.so.2 (0x0000155555325000) liblci-ucx.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblci-ucx.so (0x0000155553011000) libnl-route-3.so.200 => /lib64/libnl-route-3.so.200 (0x0000155552d7f000) libnl-3.so.200 => /lib64/libnl-3.so.200 (0x0000155552b5c000) libdl.so.2 => /lib64/libdl.so.2 (0x0000155552958000) libhwloc.so.15 => /home/jackyan1/opt/hwloc/2.9.1/lib/libhwloc.so.15 (0x00001555526f9000) libpciaccess.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libpciaccess-0.17-jqqzmoorywzwslxnvh3whvxmxgggxddg/lib/libpciaccess.so.0 (0x00001555524ef000) libxml2.so.2 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libxml2-2.10.3-riigwi634oahw6njkyhbrhqjx2hsbjyt/lib/libxml2.so.2 (0x0000155552184000) libucp.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucp.so.0 (0x0000155551eb6000) libucs.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucs.so.0 (0x0000155551c55000) libyaksa.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/yaksa-0.2-3r62jn5cdiiovsmntoqdrkzircgzvxqh/lib/libyaksa.so.0 (0x000015554f989000) libxpmem.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/xpmem-2.6.5-36-n47tincumvgfjwbnhddzsqskzs7nxohd/lib/libxpmem.so.0 (0x000015554f786000) librt.so.1 => /lib64/librt.so.1 (0x000015554f57e000) libpmi.so.0 => /cm/shared/apps/slurm/current/lib64/libpmi.so.0 (0x000015554f378000) libz.so.1 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/zlib-1.2.13-xhijn7cz7apogelukw47ulnzhhardvos/lib/libz.so.1 (0x000015554f160000) liblzma.so.5 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/xz-5.4.1-knnmdfklcssmtvciq4pupvfqsh2upbzy/lib/liblzma.so.5 (0x000015554ef33000) libiconv.so.2 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libiconv-1.17-fdzdmyikb3i5dtfkt26raiyq63tumvnq/lib/libiconv.so.2 (0x000015554ec26000) libuct.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libuct.so.0 (0x000015554e9eb000) libnuma.so.1 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/numactl-2.0.14-k3pqb32bk6b5sl2c7kvzd6errjicvsye/lib/libnuma.so.1 (0x000015554e7df000) libucm.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucm.so.0 (0x000015554e5c4000) libslurm_pmi.so => /cm/shared/apps/slurm/23.02.7/lib64/slurm/libslurm_pmi.so (0x000015554e1d2000) libresolv.so.2 => /lib64/libresolv.so.2 (0x000015554dfba000) libatomic.so.1 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libatomic.so.1 (0x000015554ddb2000)

I also tried the --mpi=pmi2 option of srun and got a different error:

srun -n 2 --mpi=pmi2 hello_world [cli_0]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=20D0539CE2EDC88B822000637577CC2B3200F8D74F[5]4F030088C70230AC7B6CE151210600020A150136CF1917B7513832A8384F9241AF360113230082BB9321060002C6CA67C9CF1917B7513832A8384F9241AF360013230082C8B7211201028F9A9BAD3BDF1A8A98[2]F0[4]CF1917B75138539E3E4BD7E037370113230082C119210600020A160136CF1917B751383A0B345023ADAE360113230082A92342088F9A9BAD3BDF1A0A8D5377CC2B32[2]705077CCAB33004F2300883B808067[4]43088F9A9BAD3BDF1A0AA7D377CC2B32[2]705077CCAB33004F2300882D75180006[2]C0241148[10]FFFF0A1501367E3977CC2B338142364F8797713526DF230007AB73014A0C[2]278AB00FA1338142364F5C7C613526DF22[2]7AD477CC2B338142364F5C7C613526DF2200010031D15C7CE1338142364FF28969352603270003C27301B30F77CCAB338142364FF28969352603270083C8730125030094057E3977CC2B33EF483850A2E73B3532DF2300072B9701490C[2]278AB00FA133EF48385077CC2B3532DF22[2]7AD477CC2B33EF48385077CC2B3532DF2200010031D15C7CE133EF4838500DDA33353203270003419701B30F77CCAB33EF4838500DDA3335320327008347970126088F9A9BAD3BDF1A0A478BBD3706360024: [cli_1]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-1-seg-1/2 value=2056E44C658714485C2000637577CC2B3200F8D74F[5]4F0300884C479A1F197DC693210600020A150136CF1917B7513832A8384F9241AF360113230082E5BD21060002C6CA67C9CF1917B7513832A8384F9241AF36001323008293FD211201028F9A9BAD3BDF1A8A98[2]F0[4]CF1917B75138539E3E4BD7E037370113230082854F210600020A160136CF1917B751383A0B345023ADAE36011323008295F342088F9A9BAD3BDF1A0A8D5377CC2B32[2]705077CCAB33004F2300883C808067[4]43088F9A9BAD3BDF1A0AA7D377CC2B32[2]705077CCAB33004F2300882E75180006[2]C0241148[10]FFFF0A1501367E3977CC2B338142364F8797713526DF230007AC7301490C[2]278AB00FA1338142364F5C7C613526DF22[2]7AD477CC2B338142364F5C7C613526DF2200010031D15C7CE1338142364FF28969352603270003C17301B30F77CCAB338142364FF28969352603270083C7730125030094057E3977CC2B33EF483850A2E73B3532DF2300072A97014B0C[2]278AB00FA133EF48385077CC2B3532DF22[2]7AD477CC2B33EF48385077CC2B3532DF2200010031D15C7CE133EF4838500DDA33353203270003409701B30F77CCAB33EF4838500DDA3335320327008346970126088F9A9BAD3BDF1A0A478BBD3706360024: ^Csrun: interrupt (one more within 1 sec to abort) srun: StepId=28943582.7 tasks 0-1: running ^Csrun: sending Ctrl-C to StepId=28943582.7 srun: forcing job termination srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

hzhou commented 8 months ago

The srun --mpi=pmi2 is working. But looks like the exchange address string gets too long to fit the PMI message limit. Not sure where the inconsistency comes from.

JiakunYan commented 6 months ago

For the error reported when srun --mpi=pmi2, manually modifying MPICH source code and reducing pmi_max_val_size by half fixed this issue. I would appreciate it if MPICH could provide an environmental variable for users to control the value (like the I_MPI_PMI_VALUE_LENGTH_MAX environmental variable in impi)

raffenet commented 5 months ago

FWIW, a simple Slurm+PMI2 example putting a max size value does not hang on the Bebop cluster here at Argonne. There may still be a bug in the segmented put in the MPIR layer for PMI2. Will investigate further...

#include <slurm/pmi2.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
  int has_parent, size, rank, appnum;
  int pmi_max_val = PMI2_MAX_VALLEN;
  int pmi_max_key = PMI2_MAX_KEYLEN;

  PMI2_Init(&has_parent, &size, &rank, &appnum);
  char *pmi_kvs_name = malloc(pmi_max_val);
  PMI2_Job_GetId(pmi_kvs_name, pmi_max_val);

  char *valbuf = malloc(pmi_max_val);
  memset(valbuf, 'a', pmi_max_val);
  valbuf[pmi_max_val - 1] = '\0';

  PMI2_KVS_Put("foo", valbuf);
  PMI2_KVS_Fence();
  int out_len;
  PMI2_KVS_Get(pmi_kvs_name, PMI2_ID_NULL, "bar", NULL, 0, &out_len);

  PMI2_Finalize();
  return 0;
}
hzhou commented 5 months ago

What Jiakun pointed out is it's likely the PMI2_MAX_VALLEN in pmi2.h is too big. It is 1024 historically. When exchange addresses and the address length is too big, we segment it according to PMI2_MAX_VALLEN, that apparently overflows the libmpi2 in Slurm.

hzhou commented 5 months ago

So I think the right solution is to fix PMI2_MAX_VALLEN in pmi2.h. The header should be consistent with the library libpmi2.so.

If we want to add a environment override, it should be named PMI2_MAX_VALLEN, IMO.

hzhou commented 5 months ago

@JiakunYan Does the example in https://github.com/pmodels/mpich/issues/6924#issuecomment-2127357799 work on SDSC Expanse?

raffenet commented 5 months ago

So I think the right solution is to fix PMI2_MAX_VALLEN in pmi2.h. The header should be consistent with the library libpmi2.so.

I am saying that the library accepts a value equal to the maximum value length on Bebop. We should confirm it can do the same on Expanse before we say this is a bug in the header.

hzhou commented 5 months ago

FWIW, a simple Slurm+PMI2 example putting a max size value does not hang on the Bebop cluster here at Argonne. There may still be a bug in the segmented put in the MPIR layer for PMI2. Will investigate further...

Make sure to check the return from the PMI2 functions. They may be errors.

raffenet commented 5 months ago

Here's output from a modified example that puts and the gets the key.

#include <slurm/pmi2.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

int main(void)
{
  int has_parent, size, rank, appnum;
  int pmi_max_val = PMI2_MAX_VALLEN;
  int pmi_max_key = PMI2_MAX_KEYLEN;

  PMI2_Init(&has_parent, &size, &rank, &appnum);
  char *pmi_kvs_name = malloc(pmi_max_val);
  PMI2_Job_GetId(pmi_kvs_name, pmi_max_val);

  char *valbuf = malloc(pmi_max_val + 1);
  memset(valbuf, 'a', pmi_max_val);
  valbuf[pmi_max_val] = '\0';
  assert(strlen(valbuf) <= pmi_max_val);
  printf("vallen = %d, max = %d\n", strlen(valbuf), pmi_max_val);

  PMI2_KVS_Put("foo", valbuf);
  printf("put %s into kvs\n", valbuf);
  PMI2_KVS_Fence();
  int out_len;
  char *outbuf = malloc(pmi_max_val + 1);
  PMI2_KVS_Get(pmi_kvs_name, PMI2_ID_NULL, "foo", outbuf, pmi_max_val + 1, &out_len);
  printf("out_len = %d, strlen = %d\n", out_len, strlen(outbuf));
  printf("got %s from kvs\n", outbuf);

  PMI2_Finalize();

  free(valbuf);
  free(outbuf);
  free(pmi_kvs_name);

  return 0;
}
[raffenet@beboplogin4]~% srun --mpi=pmi2 ./a.out
vallen = 1024, max = 1024
put aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa into kvs
out_len = 1024, strlen = 1024
got aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa from kvs
hzhou commented 5 months ago
[cli_0]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=20D0539CE2...

I suspect the PMI2_MAX_VALLEN didn't account for the size of overhead, i.e. cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=. Again this should be accounted for by the libpmi implementations since the upper layer is not aware of the internal protocols. I am not against allowing users to overwrite with the environment, but it should be commented with clear reasons.

hzhou commented 5 months ago

Here's output from a modified example that puts and the gets the key.

Yeah, this seems to support that Slurm is able accommodate 1024 value size -- its internal buffer size must be even larger.

hzhou commented 5 months ago

libpmi.so.0 => /cm/shared/apps/slurm/current/lib64/libpmi.so.0 (0x000015554f378000)

Oh, Just realized @JiakunYan was linking with PMI-1 rather than PMI-2. @raffenet Need test Slurm's PMI-1

JiakunYan commented 5 months ago

@raffenet @hzhou https://pm.bsc.es/gitlab/rarias/bscpkgs/-/issues/126 might be helpful in explaining the situation.

I think it is related to the pmi implementation of specific slurm versions.

raffenet commented 5 months ago

@JiakunYan thanks, that is helpful. I probably need to run on multiple nodes to trigger the problem. Will try again when I have a chance. It would be good, IMO, to submit a ticket with a PMI-only reproducer to Slurm.

JiakunYan commented 4 weeks ago

I am glad to find that the newest MPICH no longer has this issue. Thanks!