Open angainor opened 2 weeks ago
So I did look around and read the documentation, and found out that I should use --prtemca ras_base_launch_orted_on_hn 1
. But that did not help:
mpirun --prtemca ras_base_launch_orted_on_hn 1 -np 2 ~/gpubind_pmix.sh ./osu_bibw D D
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: nid007961
Location: mtl_ofi_component.c:1007
Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007961:128598:0:128598] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
Hi,
I compiled OpenMPI v5.0.5 on LUMI (Cray HPE SS11 system with AMD CPUs and GPUs). I used the
PrgEnv-gnu/8.5.0
environment and configured as./configure --prefix=/users/makrotki/software/openmpi5 --with-ofi=/opt/cray/libfabric/1.15.2.0/
I ran some OSU benchmarks and generally things look good. Point to point tests yield the same performance as Cray MPI. However, I stumbled upon a segfault in
MPI_Init
. Here, I allocated only 1 compute node throughslurm
. Then:I tried with 16 ranks and it sometimes works, sometimes segfaults. But with 2 ranks it segfaults always. Note that I always see this message:
regardless of how many ranks I use.
The segfault is gone when I turn off ofi:
Is this a known problem?