open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 861 forks source link

Segfault on Cray HPE system #12913

Open angainor opened 2 weeks ago

angainor commented 2 weeks ago

Hi,

I compiled OpenMPI v5.0.5 on LUMI (Cray HPE SS11 system with AMD CPUs and GPUs). I used the PrgEnv-gnu/8.5.0 environment and configured as

./configure --prefix=/users/makrotki/software/openmpi5 --with-ofi=/opt/cray/libfabric/1.15.2.0/

I ran some OSU benchmarks and generally things look good. Point to point tests yield the same performance as Cray MPI. However, I stumbled upon a segfault in MPI_Init. Here, I allocated only 1 compute node through slurm. Then:

~/software/openmpi5/bin/mpirun -np 2 ./osu_barrier
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: nid007955
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007955:08519] *** Process received signal ***
[nid007955:08519] Signal: Segmentation fault (11)
[nid007955:08519] Signal code: Address not mapped (1)
[nid007955:08519] Failing at address: 0x140074656e7a
[nid007955:08519] [ 0] /lib64/libpthread.so.0(+0x16910)[0x14f3d4b66910]
[nid007955:08519] [ 1] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3d0a6)[0x14f3cbe4e0a6]
[nid007955:08519] [ 2] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3cfeb)[0x14f3cbe4dfeb]
[nid007955:08519] [ 3] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x4d7ba)[0x14f3cbe5e7ba]
[nid007955:08519] [ 4] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(fi_fabric+0xa2)[0x14f3cbe2a172]
[nid007955:08519] [ 5] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(+0xa3db4)[0x14f3cbfb4db4]
[nid007955:08519] [ 6] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(mca_btl_base_select+0x14d)[0x14f3cbfa1ddd]
[nid007955:08519] [ 7] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x14f3d503d0c2]
[nid007955:08519] [ 8] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x14f3d503ae54]
[nid007955:08519] [ 9] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x27d34a)[0x14f3d51c634a]
[nid007955:08519] [10] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_pml_base_select+0x1ce)[0x14f3d51c287e]
[nid007955:08519] [11] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x9a92a)[0x14f3d4fe392a]
[nid007955:08519] [12] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_instance_init+0x61)[0x14f3d4fe4081]
[nid007955:08519] [13] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_init+0x96)[0x14f3d4fdb8b6]
[nid007955:08519] [14] /users/makrotki/software/openmpi5/lib/libmpi.so.40(MPI_Init+0x5e)[0x14f3d500d46e]
[nid007955:08519] [15] ./osu_barrier[0x40675d]
[nid007955:08519] [16] ./osu_barrier[0x402810]
[nid007955:08519] [17] /lib64/libc.so.6(__libc_start_main+0xef)[0x14f3d498e24d]
[nid007955:08519] [18] ./osu_barrier[0x402d7a]
[nid007955:08519] *** End of error message ***

I tried with 16 ranks and it sometimes works, sometimes segfaults. But with 2 ranks it segfaults always. Note that I always see this message:

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: nid007972
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (38)
--------------------------------------------------------------------------

regardless of how many ranks I use.

The segfault is gone when I turn off ofi:

~/software/openmpi5/bin/mpirun -mca mtl ^ofi -np 2 ./osu_barrier

# OSU MPI Barrier Latency Test v7.4
# Avg Latency(us)
             0.21

Is this a known problem?

angainor commented 2 weeks ago

So I did look around and read the documentation, and found out that I should use --prtemca ras_base_launch_orted_on_hn 1. But that did not help:

mpirun --prtemca ras_base_launch_orted_on_hn 1 -np 2 ~/gpubind_pmix.sh ./osu_bibw D D

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: nid007961
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007961:128598:0:128598] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)