Closed gcormier closed 5 years ago
Can you confirm you are trying to run on a single node ? Does the command completes or hang ?
You can first try running a non MPI app
mpirun -np 1 hostname
If it works, try collecting some more logs
mpirun --mca iof_base_verbose 10 --mca mca pml_base_verbose 10 --mca odls_base_verbose 10 -np 2 IMB-MPI1
I found that the AWS EFA installer will pull down a few things and put itself in the path, so there was a bit of a nightmare going on. Fixed that up, so now I have my actual binaries being used.
I'm now able to get IMB-MPI1 working, is there a way to force this over EFA or verify what communication link it is using?
hpc@ip-172-31-27-107:~$ mpirun -np 2 --hostfile ~/hosts --mca iof_base_verbose 10 --mca pml_base_verbose 10 --mca odls_base_verbose 10 --mca pml ucx --mca btl ^uct IMB-MPI1 pingpong
[ip-172-31-27-107:03695] mca: base: components_register: registering framework odls components
[ip-172-31-27-107:03695] mca: base: components_register: found loaded component default
[ip-172-31-27-107:03695] mca: base: components_register: component default has no register or open function
[ip-172-31-27-107:03695] mca: base: components_register: found loaded component pspawn
[ip-172-31-27-107:03695] mca: base: components_register: component pspawn has no register or open function
[ip-172-31-27-107:03695] mca: base: components_open: opening odls components
[ip-172-31-27-107:03695] mca: base: components_open: found loaded component default
[ip-172-31-27-107:03695] mca: base: components_open: component default open function successful
[ip-172-31-27-107:03695] mca: base: components_open: found loaded component pspawn
[ip-172-31-27-107:03695] mca: base: components_open: component pspawn open function successful
[ip-172-31-27-107:03695] mca:base:select: Auto-selecting odls components
[ip-172-31-27-107:03695] mca:base:select:( odls) Querying component [default]
[ip-172-31-27-107:03695] mca:base:select:( odls) Query of component [default] set priority to 10
[ip-172-31-27-107:03695] mca:base:select:( odls) Querying component [pspawn]
[ip-172-31-27-107:03695] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[ip-172-31-27-107:03695] mca:base:select:( odls) Selected component [default]
[ip-172-31-27-107:03695] mca: base: close: component pspawn closed
[ip-172-31-27-107:03695] mca: base: close: unloading component pspawn
[ip-172-31-27-107:03695] mca: base: components_register: registering framework iof components
[ip-172-31-27-107:03695] mca: base: components_register: found loaded component hnp
[ip-172-31-27-107:03695] mca: base: components_register: component hnp has no register or open function
[ip-172-31-27-107:03695] mca: base: components_register: found loaded component orted
[ip-172-31-27-107:03695] mca: base: components_register: component orted has no register or open function
[ip-172-31-27-107:03695] mca: base: components_register: found loaded component tool
[ip-172-31-27-107:03695] mca: base: components_register: component tool has no register or open function
[ip-172-31-27-107:03695] mca: base: components_open: opening iof components
[ip-172-31-27-107:03695] mca: base: components_open: found loaded component hnp
[ip-172-31-27-107:03695] mca: base: components_open: component hnp open function successful
[ip-172-31-27-107:03695] mca: base: components_open: found loaded component orted
[ip-172-31-27-107:03695] mca: base: components_open: component orted open function successful
[ip-172-31-27-107:03695] mca: base: components_open: found loaded component tool
[ip-172-31-27-107:03695] mca: base: components_open: component tool open function successful
[ip-172-31-27-107:03695] mca:base:select: Auto-selecting iof components
[ip-172-31-27-107:03695] mca:base:select:( iof) Querying component [hnp]
[ip-172-31-27-107:03695] mca:base:select:( iof) Query of component [hnp] set priority to 100
[ip-172-31-27-107:03695] mca:base:select:( iof) Querying component [orted]
[ip-172-31-27-107:03695] mca:base:select:( iof) Querying component [tool]
[ip-172-31-27-107:03695] mca:base:select:( iof) Selected component [hnp]
[ip-172-31-27-107:03695] mca: base: close: component orted closed
[ip-172-31-27-107:03695] mca: base: close: unloading component orted
[ip-172-31-27-107:03695] mca: base: close: component tool closed
[ip-172-31-27-107:03695] mca: base: close: unloading component tool
[ip-172-31-29-142:04539] mca: base: components_register: registering framework odls components
[ip-172-31-29-142:04539] mca: base: components_register: found loaded component default
[ip-172-31-29-142:04539] mca: base: components_register: component default has no register or open function
[ip-172-31-29-142:04539] mca: base: components_register: found loaded component pspawn
[ip-172-31-29-142:04539] mca: base: components_register: component pspawn has no register or open function
[ip-172-31-29-142:04539] mca: base: components_open: opening odls components
[ip-172-31-29-142:04539] mca: base: components_open: found loaded component default
[ip-172-31-29-142:04539] mca: base: components_open: component default open function successful
[ip-172-31-29-142:04539] mca: base: components_open: found loaded component pspawn
[ip-172-31-29-142:04539] mca: base: components_open: component pspawn open function successful
[ip-172-31-29-142:04539] mca:base:select: Auto-selecting odls components
[ip-172-31-29-142:04539] mca:base:select:( odls) Querying component [default]
[ip-172-31-29-142:04539] mca:base:select:( odls) Query of component [default] set priority to 10
[ip-172-31-29-142:04539] mca:base:select:( odls) Querying component [pspawn]
[ip-172-31-29-142:04539] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[ip-172-31-29-142:04539] mca:base:select:( odls) Selected component [default]
[ip-172-31-29-142:04539] mca: base: close: component pspawn closed
[ip-172-31-29-142:04539] mca: base: close: unloading component pspawn
[ip-172-31-29-142:04539] mca: base: components_register: registering framework iof components
[ip-172-31-29-142:04539] mca: base: components_register: found loaded component hnp
[ip-172-31-29-142:04539] mca: base: components_register: component hnp has no register or open function
[ip-172-31-29-142:04539] mca: base: components_register: found loaded component orted
[ip-172-31-29-142:04539] mca: base: components_register: component orted has no register or open function
[ip-172-31-29-142:04539] mca: base: components_register: found loaded component tool
[ip-172-31-29-142:04539] mca: base: components_register: component tool has no register or open function
[ip-172-31-29-142:04539] mca: base: components_open: opening iof components
[ip-172-31-29-142:04539] mca: base: components_open: found loaded component hnp
[ip-172-31-29-142:04539] mca: base: components_open: component hnp open function successful
[ip-172-31-29-142:04539] mca: base: components_open: found loaded component orted
[ip-172-31-29-142:04539] mca: base: components_open: component orted open function successful
[ip-172-31-29-142:04539] mca: base: components_open: found loaded component tool
[ip-172-31-29-142:04539] mca: base: components_open: component tool open function successful
[ip-172-31-29-142:04539] mca:base:select: Auto-selecting iof components
[ip-172-31-29-142:04539] mca:base:select:( iof) Querying component [hnp]
[ip-172-31-29-142:04539] mca:base:select:( iof) Querying component [orted]
[ip-172-31-29-142:04539] mca:base:select:( iof) Query of component [orted] set priority to 80
[ip-172-31-29-142:04539] mca:base:select:( iof) Querying component [tool]
[ip-172-31-29-142:04539] mca:base:select:( iof) Selected component [orted]
[ip-172-31-29-142:04539] mca: base: close: component hnp closed
[ip-172-31-29-142:04539] mca: base: close: unloading component hnp
[ip-172-31-29-142:04539] mca: base: close: component tool closed
[ip-172-31-29-142:04539] mca: base: close: unloading component tool
[ip-172-31-27-107:03695] [[12194,0],0] local:launch
[ip-172-31-27-107:03695] [[12194,0],0] odls:dispatch [[12194,1],0] to thread 0
[ip-172-31-27-107:03695] [[12194,0],0] odls:dispatch [[12194,1],1] to thread 0
[ip-172-31-27-107:03695] [[12194,0],0] odls:launch spawning child [[12194,1],0]
[ip-172-31-29-142:04539] [[12194,0],1] local:launch
[ip-172-31-29-142:04539] [[12194,0],1] local:launch no local procs
[ip-172-31-27-107:03695] [[12194,0],0] odls:launch spawning child [[12194,1],1]
[ip-172-31-27-107:03702] mca: base: components_register: registering framework pml components
[ip-172-31-27-107:03702] mca: base: components_register: found loaded component ucx
[ip-172-31-27-107:03702] mca: base: components_register: component ucx register function successful
[ip-172-31-27-107:03702] mca: base: components_open: opening pml components
[ip-172-31-27-107:03702] mca: base: components_open: found loaded component ucx
[ip-172-31-27-107:03701] mca: base: components_register: registering framework pml components
[ip-172-31-27-107:03701] mca: base: components_register: found loaded component ucx
[ip-172-31-27-107:03701] mca: base: components_register: component ucx register function successful
[ip-172-31-27-107:03701] mca: base: components_open: opening pml components
[ip-172-31-27-107:03701] mca: base: components_open: found loaded component ucx
[ip-172-31-27-107:03702] mca: base: components_open: component ucx open function successful
[ip-172-31-27-107:03701] mca: base: components_open: component ucx open function successful
[ip-172-31-27-107:03702] select: initializing pml component ucx
[ip-172-31-27-107:03701] select: initializing pml component ucx
[ip-172-31-27-107:03702] select: init returned priority 51
[ip-172-31-27-107:03702] selected ucx best priority 51
[ip-172-31-27-107:03702] select: component ucx selected
[ip-172-31-27-107:03701] select: init returned priority 51
[ip-172-31-27-107:03701] selected ucx best priority 51
[ip-172-31-27-107:03701] select: component ucx selected
[ip-172-31-27-107:03701] check:select: modex not reqd
[ip-172-31-27-107:03702] check:select: modex not reqd
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 2018 Update 1, MPI-1 part
#------------------------------------------------------------
# Date : Fri Jul 5 20:24:10 2019
# Machine : x86_64
# System : Linux
# Release : 4.15.0-1043-aws
# Version : #45-Ubuntu SMP Mon Jun 24 14:07:03 UTC 2019
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# IMB-MPI1 pingpong
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.23 0.00
1 1000 0.23 4.38
2 1000 0.23 8.78
4 1000 0.23 17.55
8 1000 0.23 34.73
16 1000 0.23 69.30
32 1000 0.31 102.05
64 1000 0.28 227.42
128 1000 0.53 240.32
256 1000 0.41 621.92
512 1000 0.50 1029.21
1024 1000 0.61 1673.61
2048 1000 0.82 2511.38
4096 1000 1.12 3668.42
8192 1000 2.24 3664.41
16384 1000 3.15 5204.90
32768 1000 5.07 6457.65
65536 640 8.11 8079.29
131072 320 14.72 8902.27
262144 160 13.51 19406.80
524288 80 32.14 16313.14
1048576 40 74.66 14044.49
2097152 20 190.91 10985.10
4194304 10 410.44 10218.99
# All processes entering MPI_Finalize
[ip-172-31-27-107:03702] mca: base: close: component ucx closed
[ip-172-31-27-107:03702] mca: base: close: unloading component ucx
[ip-172-31-27-107:03701] mca: base: close: component ucx closed
[ip-172-31-27-107:03701] mca: base: close: unloading component ucx
[ip-172-31-27-107:03695] [[12194,0],0] odls:wait_local_proc child process [[12194,1],1] pid 3702 terminated
[ip-172-31-27-107:03695] [[12194,0],0] odls:wait_local_proc child process [[12194,1],0] pid 3701 terminated
[ip-172-31-29-142:04539] mca: base: close: component orted closed
[ip-172-31-29-142:04539] mca: base: close: unloading component orted
[ip-172-31-29-142:04539] mca: base: close: component default closed
[ip-172-31-29-142:04539] mca: base: close: unloading component default
[ip-172-31-27-107:03695] mca: base: close: component hnp closed
[ip-172-31-27-107:03695] mca: base: close: unloading component hnp
[ip-172-31-27-107:03695] mca: base: close: component default closed
[ip-172-31-27-107:03695] mca: base: close: unloading component default
Perhaps this forced something useful
mpirun -N 2 -hostfile ~/hosts --mca iof_base_verbose 10 --mca pml_base_verbose 10 --mca odls_base_verbose 10 --mca pml ucx --mca btl ^vader,tcp,openib,uct IMB-MPI1 pingpong
[ip-172-31-27-107:04070] mca: base: components_register: registering framework odls components
[ip-172-31-27-107:04070] mca: base: components_register: found loaded component default
[ip-172-31-27-107:04070] mca: base: components_register: component default has no register or open function
[ip-172-31-27-107:04070] mca: base: components_register: found loaded component pspawn
[ip-172-31-27-107:04070] mca: base: components_register: component pspawn has no register or open function
[ip-172-31-27-107:04070] mca: base: components_open: opening odls components
[ip-172-31-27-107:04070] mca: base: components_open: found loaded component default
[ip-172-31-27-107:04070] mca: base: components_open: component default open function successful
[ip-172-31-27-107:04070] mca: base: components_open: found loaded component pspawn
[ip-172-31-27-107:04070] mca: base: components_open: component pspawn open function successful
[ip-172-31-27-107:04070] mca:base:select: Auto-selecting odls components
[ip-172-31-27-107:04070] mca:base:select:( odls) Querying component [default]
[ip-172-31-27-107:04070] mca:base:select:( odls) Query of component [default] set priority to 10
[ip-172-31-27-107:04070] mca:base:select:( odls) Querying component [pspawn]
[ip-172-31-27-107:04070] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[ip-172-31-27-107:04070] mca:base:select:( odls) Selected component [default]
[ip-172-31-27-107:04070] mca: base: close: component pspawn closed
[ip-172-31-27-107:04070] mca: base: close: unloading component pspawn
[ip-172-31-27-107:04070] mca: base: components_register: registering framework iof components
[ip-172-31-27-107:04070] mca: base: components_register: found loaded component hnp
[ip-172-31-27-107:04070] mca: base: components_register: component hnp has no register or open function
[ip-172-31-27-107:04070] mca: base: components_register: found loaded component orted
[ip-172-31-27-107:04070] mca: base: components_register: component orted has no register or open function
[ip-172-31-27-107:04070] mca: base: components_register: found loaded component tool
[ip-172-31-27-107:04070] mca: base: components_register: component tool has no register or open function
[ip-172-31-27-107:04070] mca: base: components_open: opening iof components
[ip-172-31-27-107:04070] mca: base: components_open: found loaded component hnp
[ip-172-31-27-107:04070] mca: base: components_open: component hnp open function successful
[ip-172-31-27-107:04070] mca: base: components_open: found loaded component orted
[ip-172-31-27-107:04070] mca: base: components_open: component orted open function successful
[ip-172-31-27-107:04070] mca: base: components_open: found loaded component tool
[ip-172-31-27-107:04070] mca: base: components_open: component tool open function successful
[ip-172-31-27-107:04070] mca:base:select: Auto-selecting iof components
[ip-172-31-27-107:04070] mca:base:select:( iof) Querying component [hnp]
[ip-172-31-27-107:04070] mca:base:select:( iof) Query of component [hnp] set priority to 100
[ip-172-31-27-107:04070] mca:base:select:( iof) Querying component [orted]
[ip-172-31-27-107:04070] mca:base:select:( iof) Querying component [tool]
[ip-172-31-27-107:04070] mca:base:select:( iof) Selected component [hnp]
[ip-172-31-27-107:04070] mca: base: close: component orted closed
[ip-172-31-27-107:04070] mca: base: close: unloading component orted
[ip-172-31-27-107:04070] mca: base: close: component tool closed
[ip-172-31-27-107:04070] mca: base: close: unloading component tool
[ip-172-31-29-142:05139] mca: base: components_register: registering framework odls components
[ip-172-31-29-142:05139] mca: base: components_register: found loaded component default
[ip-172-31-29-142:05139] mca: base: components_register: component default has no register or open function
[ip-172-31-29-142:05139] mca: base: components_register: found loaded component pspawn
[ip-172-31-29-142:05139] mca: base: components_register: component pspawn has no register or open function
[ip-172-31-29-142:05139] mca: base: components_open: opening odls components
[ip-172-31-29-142:05139] mca: base: components_open: found loaded component default
[ip-172-31-29-142:05139] mca: base: components_open: component default open function successful
[ip-172-31-29-142:05139] mca: base: components_open: found loaded component pspawn
[ip-172-31-29-142:05139] mca: base: components_open: component pspawn open function successful
[ip-172-31-29-142:05139] mca:base:select: Auto-selecting odls components
[ip-172-31-29-142:05139] mca:base:select:( odls) Querying component [default]
[ip-172-31-29-142:05139] mca:base:select:( odls) Query of component [default] set priority to 10
[ip-172-31-29-142:05139] mca:base:select:( odls) Querying component [pspawn]
[ip-172-31-29-142:05139] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[ip-172-31-29-142:05139] mca:base:select:( odls) Selected component [default]
[ip-172-31-29-142:05139] mca: base: close: component pspawn closed
[ip-172-31-29-142:05139] mca: base: close: unloading component pspawn
[ip-172-31-29-142:05139] mca: base: components_register: registering framework iof components
[ip-172-31-29-142:05139] mca: base: components_register: found loaded component hnp
[ip-172-31-29-142:05139] mca: base: components_register: component hnp has no register or open function
[ip-172-31-29-142:05139] mca: base: components_register: found loaded component orted
[ip-172-31-29-142:05139] mca: base: components_register: component orted has no register or open function
[ip-172-31-29-142:05139] mca: base: components_register: found loaded component tool
[ip-172-31-29-142:05139] mca: base: components_register: component tool has no register or open function
[ip-172-31-29-142:05139] mca: base: components_open: opening iof components
[ip-172-31-29-142:05139] mca: base: components_open: found loaded component hnp
[ip-172-31-29-142:05139] mca: base: components_open: component hnp open function successful
[ip-172-31-29-142:05139] mca: base: components_open: found loaded component orted
[ip-172-31-29-142:05139] mca: base: components_open: component orted open function successful
[ip-172-31-29-142:05139] mca: base: components_open: found loaded component tool
[ip-172-31-29-142:05139] mca: base: components_open: component tool open function successful
[ip-172-31-29-142:05139] mca:base:select: Auto-selecting iof components
[ip-172-31-29-142:05139] mca:base:select:( iof) Querying component [hnp]
[ip-172-31-29-142:05139] mca:base:select:( iof) Querying component [orted]
[ip-172-31-29-142:05139] mca:base:select:( iof) Query of component [orted] set priority to 80
[ip-172-31-29-142:05139] mca:base:select:( iof) Querying component [tool]
[ip-172-31-29-142:05139] mca:base:select:( iof) Selected component [orted]
[ip-172-31-29-142:05139] mca: base: close: component hnp closed
[ip-172-31-29-142:05139] mca: base: close: unloading component hnp
[ip-172-31-29-142:05139] mca: base: close: component tool closed
[ip-172-31-29-142:05139] mca: base: close: unloading component tool
[ip-172-31-27-107:04070] [[11819,0],0] local:launch
[ip-172-31-27-107:04070] [[11819,0],0] odls:dispatch [[11819,1],0] to thread 0
[ip-172-31-27-107:04070] [[11819,0],0] odls:dispatch [[11819,1],1] to thread 0
[ip-172-31-27-107:04070] [[11819,0],0] odls:launch spawning child [[11819,1],0]
[ip-172-31-29-142:05139] [[11819,0],1] local:launch
[ip-172-31-29-142:05139] [[11819,0],1] odls:dispatch [[11819,1],2] to thread 0
[ip-172-31-29-142:05139] [[11819,0],1] odls:dispatch [[11819,1],3] to thread 0
[ip-172-31-29-142:05139] [[11819,0],1] odls:launch spawning child [[11819,1],2]
[ip-172-31-27-107:04070] [[11819,0],0] odls:launch spawning child [[11819,1],1]
[ip-172-31-29-142:05139] [[11819,0],1] odls:launch spawning child [[11819,1],3]
[ip-172-31-29-142:05143] mca: base: components_register: registering framework pml components
[ip-172-31-29-142:05143] mca: base: components_register: found loaded component ucx
[ip-172-31-29-142:05143] mca: base: components_register: component ucx register function successful
[ip-172-31-29-142:05143] mca: base: components_open: opening pml components
[ip-172-31-29-142:05143] mca: base: components_open: found loaded component ucx
[ip-172-31-29-142:05144] mca: base: components_register: registering framework pml components
[ip-172-31-29-142:05144] mca: base: components_register: found loaded component ucx
[ip-172-31-29-142:05144] mca: base: components_register: component ucx register function successful
[ip-172-31-29-142:05144] mca: base: components_open: opening pml components
[ip-172-31-29-142:05144] mca: base: components_open: found loaded component ucx
[ip-172-31-27-107:04077] mca: base: components_register: registering framework pml components
[ip-172-31-27-107:04077] mca: base: components_register: found loaded component ucx
[ip-172-31-27-107:04077] mca: base: components_register: component ucx register function successful
[ip-172-31-27-107:04077] mca: base: components_open: opening pml components
[ip-172-31-27-107:04077] mca: base: components_open: found loaded component ucx
[ip-172-31-27-107:04076] mca: base: components_register: registering framework pml components
[ip-172-31-27-107:04076] mca: base: components_register: found loaded component ucx
[ip-172-31-27-107:04076] mca: base: components_register: component ucx register function successful
[ip-172-31-27-107:04076] mca: base: components_open: opening pml components
[ip-172-31-27-107:04076] mca: base: components_open: found loaded component ucx
[ip-172-31-29-142:05143] mca: base: components_open: component ucx open function successful
[ip-172-31-27-107:04077] mca: base: components_open: component ucx open function successful
[ip-172-31-29-142:05144] mca: base: components_open: component ucx open function successful
[ip-172-31-29-142:05143] select: initializing pml component ucx
[ip-172-31-29-142:05144] select: initializing pml component ucx
[ip-172-31-27-107:04077] select: initializing pml component ucx
[ip-172-31-27-107:04077] select: init returned priority 51
[ip-172-31-27-107:04077] selected ucx best priority 51
[ip-172-31-27-107:04077] select: component ucx selected
[ip-172-31-29-142:05143] select: init returned priority 51
[ip-172-31-29-142:05143] selected ucx best priority 51
[ip-172-31-29-142:05143] select: component ucx selected
[ip-172-31-29-142:05144] select: init returned priority 51
[ip-172-31-29-142:05144] selected ucx best priority 51
[ip-172-31-29-142:05144] select: component ucx selected
[ip-172-31-27-107:04076] mca: base: components_open: component ucx open function successful
[ip-172-31-27-107:04076] select: initializing pml component ucx
[ip-172-31-27-107:04076] select: init returned priority 51
[ip-172-31-27-107:04076] selected ucx best priority 51
[ip-172-31-27-107:04076] select: component ucx selected
[ip-172-31-27-107:04077] check:select: modex not reqd
[ip-172-31-27-107:04076] check:select: modex not reqd
[ip-172-31-29-142:05143] check:select: modex not reqd
[ip-172-31-29-142:05144] check:select: modex not reqd
[ip-172-31-27-107:04076] check:select: modex not reqd
[ip-172-31-27-107:04077] check:select: modex not reqd
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 2018 Update 1, MPI-1 part
#------------------------------------------------------------
# Date : Fri Jul 5 20:40:01 2019
# Machine : x86_64
# System : Linux
# Release : 4.15.0-1043-aws
# Version : #45-Ubuntu SMP Mon Jun 24 14:07:03 UTC 2019
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# IMB-MPI1 pingpong
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
[ip-172-31-29-142:05144] check:select: modex not reqd
[ip-172-31-29-142:05143] check:select: modex not reqd
[ip-172-31-29-142:05144] check:select: modex not reqd
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.32 0.00
1 1000 1.28 0.78
2 1000 1.28 1.56
4 1000 1.31 3.06
8 1000 1.30 6.15
16 1000 1.32 12.15
32 1000 1.51 21.16
64 1000 1.41 45.53
128 1000 1.64 78.20
256 1000 1.89 135.15
512 1000 1.98 258.09
1024 1000 2.18 469.31
2048 1000 2.60 788.44
4096 1000 3.25 1262.04
8192 1000 7.35 1115.04
16384 1000 9.17 1787.11
32768 1000 14.94 2193.34
65536 640 26.18 2503.31
131072 320 49.75 2634.86
[1562359201.861676] [ip-172-31-27-107:4077 :0] cma_ep.c:113 UCX ERROR process_vm_readv delivered 0 instead of 262144, error message Operation not permitted
CTRL-C
^C[ip-172-31-29-142:05139] mca: base: close: component orted closed
[ip-172-31-29-142:05139] mca: base: close: unloading component orted
[ip-172-31-29-142:05139] mca: base: close: component default closed
[ip-172-31-29-142:05139] mca: base: close: unloading component default
[ip-172-31-27-107:04070] mca: base: close: component hnp closed
[ip-172-31-27-107:04070] mca: base: close: unloading component hnp
[ip-172-31-27-107:04070] mca: base: close: component default closed
[ip-172-31-27-107:04070] mca: base: close: unloading component default
If I understand correctly, EFA is accessed via libfabric (e.g. not ucx).
In order to force libfabric (aka ofi) you can
mpirun --mca pml cm --mca mtl ofi ...
if it does not work, you can also try
mpirun --mca pml ob1 --mca btl self,ofi ...
I would expect the first command leads to better performance.
Another step forward, hopefully with helpful info:
For the first suggestion, the process hangs.
hpc@ip-172-31-23-57:~$ mpirun -np 2 --hostfile ~/hosts --mca pml cm --mca mtl ofi --mca iof_base_verbose 10 --mca pml_base_verbose 10 --mca odls_base_verbose 10 IMB-MPI1
[ip-172-31-23-57:05709] mca: base: components_register: registering framework odls components
[ip-172-31-23-57:05709] mca: base: components_register: found loaded component default
[ip-172-31-23-57:05709] mca: base: components_register: component default has no register or open function
[ip-172-31-23-57:05709] mca: base: components_register: found loaded component pspawn
[ip-172-31-23-57:05709] mca: base: components_register: component pspawn has no register or open function
[ip-172-31-23-57:05709] mca: base: components_open: opening odls components
[ip-172-31-23-57:05709] mca: base: components_open: found loaded component default
[ip-172-31-23-57:05709] mca: base: components_open: component default open function successful
[ip-172-31-23-57:05709] mca: base: components_open: found loaded component pspawn
[ip-172-31-23-57:05709] mca: base: components_open: component pspawn open function successful
[ip-172-31-23-57:05709] mca:base:select: Auto-selecting odls components
[ip-172-31-23-57:05709] mca:base:select:( odls) Querying component [default]
[ip-172-31-23-57:05709] mca:base:select:( odls) Query of component [default] set priority to 10
[ip-172-31-23-57:05709] mca:base:select:( odls) Querying component [pspawn]
[ip-172-31-23-57:05709] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[ip-172-31-23-57:05709] mca:base:select:( odls) Selected component [default]
[ip-172-31-23-57:05709] mca: base: close: component pspawn closed
[ip-172-31-23-57:05709] mca: base: close: unloading component pspawn
[ip-172-31-23-57:05709] mca: base: components_register: registering framework iof components
[ip-172-31-23-57:05709] mca: base: components_register: found loaded component hnp
[ip-172-31-23-57:05709] mca: base: components_register: component hnp has no register or open function
[ip-172-31-23-57:05709] mca: base: components_register: found loaded component orted
[ip-172-31-23-57:05709] mca: base: components_register: component orted has no register or open function
[ip-172-31-23-57:05709] mca: base: components_register: found loaded component tool
[ip-172-31-23-57:05709] mca: base: components_register: component tool has no register or open function
[ip-172-31-23-57:05709] mca: base: components_open: opening iof components
[ip-172-31-23-57:05709] mca: base: components_open: found loaded component hnp
[ip-172-31-23-57:05709] mca: base: components_open: component hnp open function successful
[ip-172-31-23-57:05709] mca: base: components_open: found loaded component orted
[ip-172-31-23-57:05709] mca: base: components_open: component orted open function successful
[ip-172-31-23-57:05709] mca: base: components_open: found loaded component tool
[ip-172-31-23-57:05709] mca: base: components_open: component tool open function successful
[ip-172-31-23-57:05709] mca:base:select: Auto-selecting iof components
[ip-172-31-23-57:05709] mca:base:select:( iof) Querying component [hnp]
[ip-172-31-23-57:05709] mca:base:select:( iof) Query of component [hnp] set priority to 100
[ip-172-31-23-57:05709] mca:base:select:( iof) Querying component [orted]
[ip-172-31-23-57:05709] mca:base:select:( iof) Querying component [tool]
[ip-172-31-23-57:05709] mca:base:select:( iof) Selected component [hnp]
[ip-172-31-23-57:05709] mca: base: close: component orted closed
[ip-172-31-23-57:05709] mca: base: close: unloading component orted
[ip-172-31-23-57:05709] mca: base: close: component tool closed
[ip-172-31-23-57:05709] mca: base: close: unloading component tool
[ip-172-31-18-76:05899] mca: base: components_register: registering framework odls components
[ip-172-31-18-76:05899] mca: base: components_register: found loaded component default
[ip-172-31-18-76:05899] mca: base: components_register: component default has no register or open function
[ip-172-31-18-76:05899] mca: base: components_register: found loaded component pspawn
[ip-172-31-18-76:05899] mca: base: components_register: component pspawn has no register or open function
[ip-172-31-18-76:05899] mca: base: components_open: opening odls components
[ip-172-31-18-76:05899] mca: base: components_open: found loaded component default
[ip-172-31-18-76:05899] mca: base: components_open: component default open function successful
[ip-172-31-18-76:05899] mca: base: components_open: found loaded component pspawn
[ip-172-31-18-76:05899] mca: base: components_open: component pspawn open function successful
[ip-172-31-18-76:05899] mca:base:select: Auto-selecting odls components
[ip-172-31-18-76:05899] mca:base:select:( odls) Querying component [default]
[ip-172-31-18-76:05899] mca:base:select:( odls) Query of component [default] set priority to 10
[ip-172-31-18-76:05899] mca:base:select:( odls) Querying component [pspawn]
[ip-172-31-18-76:05899] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[ip-172-31-18-76:05899] mca:base:select:( odls) Selected component [default]
[ip-172-31-18-76:05899] mca: base: close: component pspawn closed
[ip-172-31-18-76:05899] mca: base: close: unloading component pspawn
[ip-172-31-18-76:05899] mca: base: components_register: registering framework iof components
[ip-172-31-18-76:05899] mca: base: components_register: found loaded component hnp
[ip-172-31-18-76:05899] mca: base: components_register: component hnp has no register or open function
[ip-172-31-18-76:05899] mca: base: components_register: found loaded component orted
[ip-172-31-18-76:05899] mca: base: components_register: component orted has no register or open function
[ip-172-31-18-76:05899] mca: base: components_register: found loaded component tool
[ip-172-31-18-76:05899] mca: base: components_register: component tool has no register or open function
[ip-172-31-18-76:05899] mca: base: components_open: opening iof components
[ip-172-31-18-76:05899] mca: base: components_open: found loaded component hnp
[ip-172-31-18-76:05899] mca: base: components_open: component hnp open function successful
[ip-172-31-18-76:05899] mca: base: components_open: found loaded component orted
[ip-172-31-18-76:05899] mca: base: components_open: component orted open function successful
[ip-172-31-18-76:05899] mca: base: components_open: found loaded component tool
[ip-172-31-18-76:05899] mca: base: components_open: component tool open function successful
[ip-172-31-18-76:05899] mca:base:select: Auto-selecting iof components
[ip-172-31-18-76:05899] mca:base:select:( iof) Querying component [hnp]
[ip-172-31-18-76:05899] mca:base:select:( iof) Querying component [orted]
[ip-172-31-18-76:05899] mca:base:select:( iof) Query of component [orted] set priority to 80
[ip-172-31-18-76:05899] mca:base:select:( iof) Querying component [tool]
[ip-172-31-18-76:05899] mca:base:select:( iof) Selected component [orted]
[ip-172-31-18-76:05899] mca: base: close: component hnp closed
[ip-172-31-18-76:05899] mca: base: close: unloading component hnp
[ip-172-31-18-76:05899] mca: base: close: component tool closed
[ip-172-31-18-76:05899] mca: base: close: unloading component tool
[ip-172-31-23-57:05709] [[2541,0],0] local:launch
[ip-172-31-23-57:05709] [[2541,0],0] odls:dispatch [[2541,1],0] to thread 0
[ip-172-31-23-57:05709] [[2541,0],0] odls:dispatch [[2541,1],1] to thread 0
[ip-172-31-23-57:05709] [[2541,0],0] odls:launch spawning child [[2541,1],0]
[ip-172-31-18-76:05899] [[2541,0],1] local:launch
[ip-172-31-18-76:05899] [[2541,0],1] local:launch no local procs
[ip-172-31-23-57:05709] [[2541,0],0] odls:launch spawning child [[2541,1],1]
[ip-172-31-23-57:05716] mca: base: components_register: registering framework pml components
[ip-172-31-23-57:05716] mca: base: components_register: found loaded component cm
[ip-172-31-23-57:05716] mca: base: components_register: component cm register function successful
[ip-172-31-23-57:05716] mca: base: components_open: opening pml components
[ip-172-31-23-57:05716] mca: base: components_open: found loaded component cm
[ip-172-31-23-57:05716] mca: base: components_open: component cm open function successful
[ip-172-31-23-57:05716] select: initializing pml component cm
[ip-172-31-23-57:05715] mca: base: components_register: registering framework pml components
[ip-172-31-23-57:05715] mca: base: components_register: found loaded component cm
[ip-172-31-23-57:05715] mca: base: components_register: component cm register function successful
[ip-172-31-23-57:05715] mca: base: components_open: opening pml components
[ip-172-31-23-57:05715] mca: base: components_open: found loaded component cm
[ip-172-31-23-57:05715] mca: base: components_open: component cm open function successful
[ip-172-31-23-57:05715] select: initializing pml component cm
[ip-172-31-23-57:05716] select: init returned priority 25
[ip-172-31-23-57:05716] selected cm best priority 25
[ip-172-31-23-57:05716] select: component cm selected
[ip-172-31-23-57:05715] select: init returned priority 25
[ip-172-31-23-57:05715] selected cm best priority 25
[ip-172-31-23-57:05715] select: component cm selected
[ip-172-31-23-57:05716] check:select: modex not reqd
[ip-172-31-23-57:05715] check:select: modex not reqd
^C
[ip-172-31-18-76:05899] mca: base: close: component orted closed
[ip-172-31-18-76:05899] mca: base: close: unloading component orted
[ip-172-31-18-76:05899] mca: base: close: component default closed
[ip-172-31-18-76:05899] mca: base: close: unloading component default
[ip-172-31-23-57:05709] mca: base: close: component hnp closed
[ip-172-31-23-57:05709] mca: base: close: unloading component hnp
[ip-172-31-23-57:05709] mca: base: close: component default closed
[ip-172-31-23-57:05709] mca: base: close: unloading component default
Second command errors out:
hpc@ip-172-31-23-57:~$ mpirun -np 2 --hostfile ~/hosts --mca pml ob1 --mca btl self,ofi --mca iof_base_verbose 10 --mca pml_base_verbose 10 --mca odls_base_verbose 10 IMB-MPI1
[ip-172-31-23-57:05749] mca: base: components_register: registering framework odls components
[ip-172-31-23-57:05749] mca: base: components_register: found loaded component default
[ip-172-31-23-57:05749] mca: base: components_register: component default has no register or open function
[ip-172-31-23-57:05749] mca: base: components_register: found loaded component pspawn
[ip-172-31-23-57:05749] mca: base: components_register: component pspawn has no register or open function
[ip-172-31-23-57:05749] mca: base: components_open: opening odls components
[ip-172-31-23-57:05749] mca: base: components_open: found loaded component default
[ip-172-31-23-57:05749] mca: base: components_open: component default open function successful
[ip-172-31-23-57:05749] mca: base: components_open: found loaded component pspawn
[ip-172-31-23-57:05749] mca: base: components_open: component pspawn open function successful
[ip-172-31-23-57:05749] mca:base:select: Auto-selecting odls components
[ip-172-31-23-57:05749] mca:base:select:( odls) Querying component [default]
[ip-172-31-23-57:05749] mca:base:select:( odls) Query of component [default] set priority to 10
[ip-172-31-23-57:05749] mca:base:select:( odls) Querying component [pspawn]
[ip-172-31-23-57:05749] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[ip-172-31-23-57:05749] mca:base:select:( odls) Selected component [default]
[ip-172-31-23-57:05749] mca: base: close: component pspawn closed
[ip-172-31-23-57:05749] mca: base: close: unloading component pspawn
[ip-172-31-23-57:05749] mca: base: components_register: registering framework iof components
[ip-172-31-23-57:05749] mca: base: components_register: found loaded component hnp
[ip-172-31-23-57:05749] mca: base: components_register: component hnp has no register or open function
[ip-172-31-23-57:05749] mca: base: components_register: found loaded component orted
[ip-172-31-23-57:05749] mca: base: components_register: component orted has no register or open function
[ip-172-31-23-57:05749] mca: base: components_register: found loaded component tool
[ip-172-31-23-57:05749] mca: base: components_register: component tool has no register or open function
[ip-172-31-23-57:05749] mca: base: components_open: opening iof components
[ip-172-31-23-57:05749] mca: base: components_open: found loaded component hnp
[ip-172-31-23-57:05749] mca: base: components_open: component hnp open function successful
[ip-172-31-23-57:05749] mca: base: components_open: found loaded component orted
[ip-172-31-23-57:05749] mca: base: components_open: component orted open function successful
[ip-172-31-23-57:05749] mca: base: components_open: found loaded component tool
[ip-172-31-23-57:05749] mca: base: components_open: component tool open function successful
[ip-172-31-23-57:05749] mca:base:select: Auto-selecting iof components
[ip-172-31-23-57:05749] mca:base:select:( iof) Querying component [hnp]
[ip-172-31-23-57:05749] mca:base:select:( iof) Query of component [hnp] set priority to 100
[ip-172-31-23-57:05749] mca:base:select:( iof) Querying component [orted]
[ip-172-31-23-57:05749] mca:base:select:( iof) Querying component [tool]
[ip-172-31-23-57:05749] mca:base:select:( iof) Selected component [hnp]
[ip-172-31-23-57:05749] mca: base: close: component orted closed
[ip-172-31-23-57:05749] mca: base: close: unloading component orted
[ip-172-31-23-57:05749] mca: base: close: component tool closed
[ip-172-31-23-57:05749] mca: base: close: unloading component tool
[ip-172-31-18-76:05992] mca: base: components_register: registering framework odls components
[ip-172-31-18-76:05992] mca: base: components_register: found loaded component default
[ip-172-31-18-76:05992] mca: base: components_register: component default has no register or open function
[ip-172-31-18-76:05992] mca: base: components_register: found loaded component pspawn
[ip-172-31-18-76:05992] mca: base: components_register: component pspawn has no register or open function
[ip-172-31-18-76:05992] mca: base: components_open: opening odls components
[ip-172-31-18-76:05992] mca: base: components_open: found loaded component default
[ip-172-31-18-76:05992] mca: base: components_open: component default open function successful
[ip-172-31-18-76:05992] mca: base: components_open: found loaded component pspawn
[ip-172-31-18-76:05992] mca: base: components_open: component pspawn open function successful
[ip-172-31-18-76:05992] mca:base:select: Auto-selecting odls components
[ip-172-31-18-76:05992] mca:base:select:( odls) Querying component [default]
[ip-172-31-18-76:05992] mca:base:select:( odls) Query of component [default] set priority to 10
[ip-172-31-18-76:05992] mca:base:select:( odls) Querying component [pspawn]
[ip-172-31-18-76:05992] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[ip-172-31-18-76:05992] mca:base:select:( odls) Selected component [default]
[ip-172-31-18-76:05992] mca: base: close: component pspawn closed
[ip-172-31-18-76:05992] mca: base: close: unloading component pspawn
[ip-172-31-18-76:05992] mca: base: components_register: registering framework iof components
[ip-172-31-18-76:05992] mca: base: components_register: found loaded component hnp
[ip-172-31-18-76:05992] mca: base: components_register: component hnp has no register or open function
[ip-172-31-18-76:05992] mca: base: components_register: found loaded component orted
[ip-172-31-18-76:05992] mca: base: components_register: component orted has no register or open function
[ip-172-31-18-76:05992] mca: base: components_register: found loaded component tool
[ip-172-31-18-76:05992] mca: base: components_register: component tool has no register or open function
[ip-172-31-18-76:05992] mca: base: components_open: opening iof components
[ip-172-31-18-76:05992] mca: base: components_open: found loaded component hnp
[ip-172-31-18-76:05992] mca: base: components_open: component hnp open function successful
[ip-172-31-18-76:05992] mca: base: components_open: found loaded component orted
[ip-172-31-18-76:05992] mca: base: components_open: component orted open function successful
[ip-172-31-18-76:05992] mca: base: components_open: found loaded component tool
[ip-172-31-18-76:05992] mca: base: components_open: component tool open function successful
[ip-172-31-18-76:05992] mca:base:select: Auto-selecting iof components
[ip-172-31-18-76:05992] mca:base:select:( iof) Querying component [hnp]
[ip-172-31-18-76:05992] mca:base:select:( iof) Querying component [orted]
[ip-172-31-18-76:05992] mca:base:select:( iof) Query of component [orted] set priority to 80
[ip-172-31-18-76:05992] mca:base:select:( iof) Querying component [tool]
[ip-172-31-18-76:05992] mca:base:select:( iof) Selected component [orted]
[ip-172-31-18-76:05992] mca: base: close: component hnp closed
[ip-172-31-18-76:05992] mca: base: close: unloading component hnp
[ip-172-31-18-76:05992] mca: base: close: component tool closed
[ip-172-31-18-76:05992] mca: base: close: unloading component tool
[ip-172-31-23-57:05749] [[2517,0],0] local:launch
[ip-172-31-23-57:05749] [[2517,0],0] odls:dispatch [[2517,1],0] to thread 0
[ip-172-31-23-57:05749] [[2517,0],0] odls:dispatch [[2517,1],1] to thread 0
[ip-172-31-23-57:05749] [[2517,0],0] odls:launch spawning child [[2517,1],0]
[ip-172-31-18-76:05992] [[2517,0],1] local:launch
[ip-172-31-18-76:05992] [[2517,0],1] local:launch no local procs
[ip-172-31-23-57:05749] [[2517,0],0] odls:launch spawning child [[2517,1],1]
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: ip-172-31-23-57
Framework: btl
Component: ofi
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_bml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[ip-172-31-23-57:05756] *** An error occurred in MPI_Init
[ip-172-31-23-57:05756] *** reported by process [164954113,1]
[ip-172-31-23-57:05756] *** on a NULL communicator
[ip-172-31-23-57:05756] *** Unknown error
[ip-172-31-23-57:05756] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-172-31-23-57:05756] *** and potentially your MPI job)
[ip-172-31-18-76:05992] mca: base: close: component orted closed
[ip-172-31-18-76:05992] mca: base: close: unloading component orted
[ip-172-31-18-76:05992] mca: base: close: component default closed
[ip-172-31-18-76:05992] mca: base: close: unloading component default
[ip-172-31-23-57:05749] 1 more process has sent help message help-mca-base.txt / find-available:not-valid
[ip-172-31-23-57:05749] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-172-31-23-57:05749] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[ip-172-31-23-57:05749] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[ip-172-31-23-57:05749] mca: base: close: component hnp closed
[ip-172-31-23-57:05749] mca: base: close: unloading component hnp
[ip-172-31-23-57:05749] mca: base: close: component default closed
[ip-172-31-23-57:05749] mca: base: close: unloading component default
Here is the output from the configure prior to compiling OpenMPI:
/configure --prefix=/usr --enable-static --enable-shared --with-cuda=/usr/include --with-libfabric=/opt/amazon/efa
Open MPI configuration:
-----------------------
Version: 4.0.1
Build MPI C bindings: yes
Build MPI C++ bindings (deprecated): no
Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
MPI Build Java bindings (experimental): no
Build Open SHMEM support: yes
Debug build: no
Platform file: (none)
Miscellaneous
-----------------------
CUDA support: yes
HWLOC support: internal
Libevent support: internal
PMIx support: Internal
Transports
-----------------------
Cisco usNIC: no
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
Intel TrueScale (PSM): no
Mellanox MXM: no
Open UCX: yes
OpenFabrics OFI Libfabric: yes
OpenFabrics Verbs: yes
Portals4: no
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes
Resource Managers
-----------------------
Cray Alps: no
Grid Engine: no
LSF: no
Moab: no
Slurm: yes
ssh/rsh: yes
Torque: no
OMPIO File Systems
-----------------------
Generic Unix FS: yes
Lustre: no
PVFS2/OrangeFS: no
btl/ofi
is in the master
branch but not in the v4.0.x
branch, so the second command cannot work.
What if you
mpirun --mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 10 ...
Hopefully, this will provide some more helpful logs.
Thanks for your continued help :)
Here was the output:
hpc@ip-172-31-22-242:~$ mpirun -np 2 --hostfile ~/hosts --mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 10 IMB-MPI1
[ip-172-31-22-242:05353] mca: base: components_register: registering framework pml components
[ip-172-31-22-242:05353] mca: base: components_register: found loaded component cm
[ip-172-31-22-242:05353] mca: base: components_register: component cm register function successful
[ip-172-31-22-242:05353] mca: base: components_open: opening pml components
[ip-172-31-22-242:05353] mca: base: components_open: found loaded component cm
[ip-172-31-22-242:05353] mca: base: components_register: registering framework mtl components
[ip-172-31-22-242:05353] mca: base: components_register: found loaded component ofi
[ip-172-31-22-242:05353] mca: base: components_register: component ofi register function successful
[ip-172-31-22-242:05353] mca: base: components_open: opening mtl components
[ip-172-31-22-242:05353] mca: base: components_open: found loaded component ofi
[ip-172-31-22-242:05353] mca: base: components_open: component ofi open function successful
[ip-172-31-22-242:05353] mca: base: components_open: component cm open function successful
[ip-172-31-22-242:05353] select: initializing pml component cm
[ip-172-31-22-242:05353] mca:base:select: Auto-selecting mtl components
[ip-172-31-22-242:05353] mca:base:select:( mtl) Querying component [ofi]
[ip-172-31-22-242:05353] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[ip-172-31-22-242:05353] mca:base:select:( mtl) Selected component [ofi]
[ip-172-31-22-242:05353] select: initializing mtl component ofi
[ip-172-31-22-242:05353] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[ip-172-31-22-242:05353] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[ip-172-31-22-242:05353] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[ip-172-31-22-242:05353] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[ip-172-31-22-242:05353] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[ip-172-31-22-242:05353] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[ip-172-31-22-242:05353] mtl_ofi_component.c:347: mtl:ofi:prov: efa;ofi_rxr
[ip-172-31-22-242:05352] mca: base: components_register: registering framework pml components
[ip-172-31-22-242:05352] mca: base: components_register: found loaded component cm
[ip-172-31-22-242:05352] mca: base: components_register: component cm register function successful
[ip-172-31-22-242:05352] mca: base: components_open: opening pml components
[ip-172-31-22-242:05352] mca: base: components_open: found loaded component cm
[ip-172-31-22-242:05352] mca: base: components_register: registering framework mtl components
[ip-172-31-22-242:05352] mca: base: components_register: found loaded component ofi
[ip-172-31-22-242:05352] mca: base: components_register: component ofi register function successful
[ip-172-31-22-242:05352] mca: base: components_open: opening mtl components
[ip-172-31-22-242:05352] mca: base: components_open: found loaded component ofi
[ip-172-31-22-242:05352] mca: base: components_open: component ofi open function successful
[ip-172-31-22-242:05352] mca: base: components_open: component cm open function successful
[ip-172-31-22-242:05352] select: initializing pml component cm
[ip-172-31-22-242:05352] mca:base:select: Auto-selecting mtl components
[ip-172-31-22-242:05352] mca:base:select:( mtl) Querying component [ofi]
[ip-172-31-22-242:05352] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[ip-172-31-22-242:05352] mca:base:select:( mtl) Selected component [ofi]
[ip-172-31-22-242:05352] select: initializing mtl component ofi
[ip-172-31-22-242:05352] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[ip-172-31-22-242:05352] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[ip-172-31-22-242:05352] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[ip-172-31-22-242:05352] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[ip-172-31-22-242:05352] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[ip-172-31-22-242:05352] mtl_ofi_component.c:336: mtl:ofi: "tcp;ofi_rxm" in exclude list
[ip-172-31-22-242:05352] mtl_ofi_component.c:347: mtl:ofi:prov: efa;ofi_rxr
[ip-172-31-22-242:05353] select: init returned success
[ip-172-31-22-242:05353] select: component ofi selected
[ip-172-31-22-242:05353] select: init returned priority 25
[ip-172-31-22-242:05353] selected cm best priority 25
[ip-172-31-22-242:05353] select: component cm selected
[ip-172-31-22-242:05352] select: init returned success
[ip-172-31-22-242:05352] select: component ofi selected
[ip-172-31-22-242:05352] select: init returned priority 25
[ip-172-31-22-242:05352] selected cm best priority 25
[ip-172-31-22-242:05352] select: component cm selected
[ip-172-31-22-242:05353] check:select: modex not reqd
[ip-172-31-22-242:05352] check:select: modex not reqd
^C
From your logs, it looks like the EFA provider is getting picked up just fine:
[ip-172-31-22-242:05352] mtl_ofi_component.c:347: mtl:ofi:prov: efa;ofi_rxr
[ip-172-31-22-242:05353] select: init returned success
[ip-172-31-22-242:05353] select: component ofi selected
Can you verify that your security groups are configured for EFA use correctly? See the documentation here. More specifically:
An EFA requires a security group that allows all inbound and outbound traffic to and from the security group itself.
Make sure you have both an ingress and an egress rule allowing traffic to itself (having 0.0.0.0/0
will not suffice, you need an explicit rule to allow traffic to itself with sg-xxxxxx
)
@rajachan , I had the ingress rule set but not the egress! Adding the SG explicitly definitely resolved connectivity.
I realizing benchmarking MPI is probably an art in itself, but I have the following results which seem to indicate perhaps something is still not quite right?
mpirun -N 36 --hostfile ~/hosts --mca pml_base_verbose 10 --mca mtl_base_verbose 10 --mca pml cm --mca mtl ofi IMB-MPI1 sendrecv
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 72
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 30.34 30.48 30.41 0.00
1 1000 38.86 39.82 39.41 0.05
2 1000 39.12 40.10 39.67 0.10
4 1000 39.19 40.05 39.66 0.20
8 1000 38.96 39.86 39.49 0.40
16 1000 38.82 39.68 39.29 0.81
32 1000 39.18 39.72 39.45 1.61
64 1000 38.74 39.64 39.24 3.23
128 1000 39.09 39.82 39.45 6.43
256 1000 39.14 40.05 39.64 12.78
512 1000 39.42 40.21 39.84 25.47
1024 1000 39.74 40.64 40.25 50.39
2048 1000 40.64 41.57 41.17 98.53
4096 1000 41.77 42.67 42.26 191.99
8192 1000 48.21 48.72 48.48 336.29
16384 1000 115.72 116.49 116.14 281.30
32768 1000 176.62 177.37 177.02 369.48
65536 640 302.35 304.22 303.33 430.85
131072 320 536.86 556.79 550.28 470.81
262144 160 1106.44 1212.06 1177.02 432.56
524288 80 1988.14 2303.71 2209.96 455.17
1048576 40 3339.83 4635.12 4242.20 452.45
2097152 20 7152.87 7997.41 7596.22 524.46
4194304 10 13889.11 17184.22 15639.12 488.16
mpirun -N 36 --hostfile ~/hosts --mca pml_base_verbose 10 --mca mtl_base_verbose 10 IMB-MPI1 sendrecv
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 72
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 3.11 3.39 3.23 0.00
1 1000 3.23 3.50 3.35 0.57
2 1000 3.12 3.39 3.23 1.18
4 1000 3.06 3.34 3.18 2.40
8 1000 3.03 3.32 3.17 4.82
16 1000 3.13 3.40 3.24 9.41
32 1000 3.27 3.64 3.41 17.59
64 1000 3.31 3.56 3.45 35.91
128 1000 3.72 4.02 3.85 63.70
256 1000 4.12 4.41 4.24 116.21
512 1000 4.07 4.31 4.22 237.46
1024 1000 4.48 4.69 4.59 437.03
2048 1000 5.36 5.57 5.48 735.79
4096 1000 7.18 7.45 7.28 1099.24
8192 1000 14.09 14.52 14.30 1128.26
16384 1000 25.11 25.74 25.49 1272.94
32768 1000 47.02 48.23 47.76 1358.72
65536 640 87.31 91.65 90.05 1430.07
131072 320 160.43 178.24 170.95 1470.73
262144 160 233.71 281.05 254.06 1865.46
524288 80 391.63 521.15 445.00 2012.04
1048576 40 650.94 1052.45 839.28 1992.63
2097152 20 883.78 2354.40 1578.68 1781.48
4194304 10 2147.71 4961.45 2956.92 1690.76
Full output for forcing OFI : https://gist.github.com/gcormier/fcc71c7500b3fc443d83aa4e235291e7
Full output for autoselect : https://gist.github.com/gcormier/549e0f9c868e5600733d641aa2e2e342
It looks like everything is working correctly. You're running on one instance. EFA (today) does not have shared memory support, but using ENA (ie, TCP) does have shared memory support. While we will add shared memory support to EFA's Libfabric provider in the future, I don't have a hard timeline I can share.
You're running on one instance.
My intent above was to test between the two instances - 36 processes on each ( to match 36 physical cores of c5n.18xlarge) - did I do something wrong?
My intent above was to test between the two instances - 36 processes on each ( to match 36 physical cores of c5n.18xlarge) - did I do something wrong?
You're launching 36 ranks (-N 36
), and each c5n.18xlarge has 36 cores, so (assuming that you didn't specify a slots count in the hosts file) Open MPI will pack all the ranks onto one instance. If you want to run on two instances, either -N 72
or -npernode 36
would work.
Note that even in this case, you'll see EFA perform worse than ENA for small number of instances, because of the impact of shared memory compared to using the NIC for on-instance communication.
Okay, thanks. I might play around with a bit more tuning, but otherwise I think this answers most of what I need to know and can close the issue.. Is there a good place to stay up to date on the latest EFA developments other then looking for AWS blog postings?
@gcormier Hi, I am running two c5n.18xl instances with efa+libfabric+openmpi
installed through aws-efa-installer
. I failed to run IMB-MPI1
across two instances. With command mpirun -np 2 --hostfile ~/hosts -x PATH -x LD_LIBRARY_PATH --mca pml cm --mca mtl ofi --mca pml_base_verbose 10 --mca mtl_base_verbose 10 IMB-MPI1
I am sure two instances installed required software and can connect with each other. But I got following error:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
I have checked .bashrc
file, PATH
and LD_LIBRARY_PATH
are both there,
and orted
can be invoked in command line.
Do you have any suggestions?
Can you check to see if the version of Open MPI installed by the EFA installer is in your path ($ which mpirun
should point to /opt/amazon/openmpi/bin/mpirun
)? Can you also share the contents of /opt/amazon/efa_installed_packages
?
Just from that error, it looks like orte is unable to find the orted binary. Setting the -prefix
mpirun parameter and pointing to the Open MPI install path should do it. If you are indeed using Open MPI from /opt/amazon/openmpi
, I am a bit confused why this did not work out of the box.
Looking back at this thread, it is also worth pointing out that the EFA provider now supports shared memory communication between endpoints on the same instance. Updating to the latest EFA installer (v1.8.3 as of this comment) should give you intra-instance communication performance improvements.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html
Looking back at this thread, it is also worth pointing out that the EFA provider now supports shared memory communication between endpoints on the same instance. Updating to the latest EFA installer (v1.8.3 as of this comment) should give you intra-instance communication performance improvements.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html
hi @rajachan Thanks for you quick comments, adding the prefix of mpirun
works for me.
While one thing confused me here, how to determine whether the efa module has been enabled.
After checking the output log, there is several lines relevant to efa/ofi:
[ip-172-31-9-118:05639] selected cm best priority 25
[ip-172-31-9-118:05639] select: component cm selected
[ip-172-31-9-118:05679] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[ip-172-31-9-118:05679] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[ip-172-31-9-118:05679] mtl_ofi_component.c:347: mtl:ofi:prov: efa
Because the mtl:ofi:provider_include = "(null)"
here is confusing.
That's expected. mtl:ofi:provider_include
is just pointing out that you did not provide an explicit provider include list at runtime. The mtl:ofi:provider_exclude
lists the providers excluded by default in the OFI MTL, and mtl:ofi:prov: efa
is pointing out that EFA did get selected.
Glad the prefix flag worked, but I am still curious why you had to provide it in the first place. Are you using Open MPI from /opt/amazon or do you have a custom install?
I am using the one installed by the aws-efa-installer, the openmpi located in
/opt/amazon/openmpi
(Note this might relate to #6723 , but I wanted to start a new thread. Right now on Azure, I have a ticket open as IB doesn't seem to be showing up on the instances so I figured I'd tackle AWS.)
I have OpenMPI 4.0.1 with UCX 1.5.1 (all manually built).
The fabric adapter is online
I am not receiving any output from this command
hpc@ip-172-31-21-128:~/fvcom/_run$ mpirun IMB-MPI1
Any thoughts on how to begin debugging this?