Closed gcormier closed 5 years ago
Packages I am installing on a new image
- cmake
- git
- makedepf90
- gfortran
- gcc
- libnetcdf-dev
- libnetcdff-dev
- netcdf-bin
- openmpi-bin
- openmpi-common
- libopenmpi-dev
- libhdf5-openmpi-dev
- patch
- htop
- iptraf-ng
Compiling IMB
git clone https://github.com/intel/opa-mpi-apps/
cd opa-mpi-apps/MpiApps/apps/imb/src
make CC=mpicc
Any thoughts or ideas? Do you need additional information?
It would appear the task is running, just no output.
hpc@hpc-fvcom-vm1:~/fvcom/_run$ mpirun --debug-daemons -np 4 ./IMB-MPI1
[hpc-fvcom-vm1:27694] [[56012,0],0] orted_cmd: received add_local_procs
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 4
MPIR_proctable:
(i, host, exe, pid) = (0, hpc-fvcom-vm1, /home/hpc/fvcom/_run/./IMB-MPI1, 27699)
(i, host, exe, pid) = (1, hpc-fvcom-vm1, /home/hpc/fvcom/_run/./IMB-MPI1, 27700)
(i, host, exe, pid) = (2, hpc-fvcom-vm1, /home/hpc/fvcom/_run/./IMB-MPI1, 27701)
(i, host, exe, pid) = (3, hpc-fvcom-vm1, /home/hpc/fvcom/_run/./IMB-MPI1, 27704)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
This works and displays output as expected.
mpirun --mca btl self,tcp ./IMB-MPI1
So it would appear I need to switch gears now to debugging IB functionality.
hpc@hpc-fvcom-vm1:~/fvcom/_run$ mpirun --debug-daemons -np 8 --host 10.10.1.4:4,10.10.1.5:4 --mca btl_openib_verbose 9 --mca btl self,vader,openib,tcp ./IMB-MPI1
Daemon [[1802,0],1] checking in as pid 41834 on host hpc-fvcom-vm2
[hpc-fvcom-vm2:41834] [[1802,0],1] orted: up and running - waiting for commands!
[hpc-fvcom-vm1:45544] [[1802,0],0] orted_cmd: received add_local_procs
[hpc-fvcom-vm2:41834] [[1802,0],1] orted_cmd: received tree_spawn
[hpc-fvcom-vm2:41834] [[1802,0],1] orted_cmd: received add_local_procs
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 8
MPIR_proctable:
(i, host, exe, pid) = (0, hpc-fvcom-vm1, /home/hpc/fvcom/_run/./IMB-MPI1, 45554)
(i, host, exe, pid) = (1, hpc-fvcom-vm1, /home/hpc/fvcom/_run/./IMB-MPI1, 45555)
(i, host, exe, pid) = (2, hpc-fvcom-vm1, /home/hpc/fvcom/_run/./IMB-MPI1, 45556)
(i, host, exe, pid) = (3, hpc-fvcom-vm1, /home/hpc/fvcom/_run/./IMB-MPI1, 45559)
(i, host, exe, pid) = (4, 10.10.1.5, /home/hpc/fvcom/_run/./IMB-MPI1, 41838)
(i, host, exe, pid) = (5, 10.10.1.5, /home/hpc/fvcom/_run/./IMB-MPI1, 41839)
(i, host, exe, pid) = (6, 10.10.1.5, /home/hpc/fvcom/_run/./IMB-MPI1, 41840)
(i, host, exe, pid) = (7, 10.10.1.5, /home/hpc/fvcom/_run/./IMB-MPI1, 41843)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[hpc-fvcom-vm1][[1802,1],0][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],0][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],0][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[hpc-fvcom-vm1][[1802,1],0][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: default
[hpc-fvcom-vm1][[1802,1],1][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],1][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],1][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[hpc-fvcom-vm1][[1802,1],1][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: default
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: default
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: default
[hpc-fvcom-vm2][[1802,1],4][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],4][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],4][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[hpc-fvcom-vm2][[1802,1],4][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: default
[hpc-fvcom-vm2][[1802,1],5][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],5][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],5][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[hpc-fvcom-vm2][[1802,1],5][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: default
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: default
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: default
[hpc-fvcom-vm1][[1802,1],0][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],0][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],1][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],1][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],1][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],1][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],2][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm1][[1802,1],3][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],5][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],5][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],5][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],5][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],6][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],4][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],4][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:173:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4120
[hpc-fvcom-vm2][[1802,1],7][btl_openib_ini.c:192:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX5
*hangs here*
I'm afraid I know nothing about the Azure platform, but have you tried updating to Open MPI v4.x with Open UCX / the ucx
PML (vs. the ob1
PML + the openib
BTL)?
It's something I can try to give a shot when I have a few cycles! I will report back.
@gcormier Please use UCX PML on Azure HPC platforms. It is well tested and gives best performance.
@jladd-mlnx I'll give it a shot. Sorry for the delay - IB devices weren't appearing on hc44rs, but I just tried provisioning one and it seems to be back to normal so I can spend some time on this now.
Very nice! Working!
Some useful things below in case others stumble on this. Most can be found at https://github.com/gcormier/hpc-fvcom/tree/master/azure
mpirun -npernode 44 -mca pml ucx --mca btl ^vader,tcp,openib -x UCX_IB_PKEY=$UCX_IB_PKEY --hostfile ~/hosts IMB-MPI1 sendrecv
cat << EOF | sudo tee -a /etc/security/limits.conf
* hard memlock unlimited
* soft memlock unlimited
* hard nofile 65535
* soft nofile 65535
EOF
#!/bin/bash
high_key=`sort -r /sys/class/infiniband/mlx5_0/ports/1/pkeys/* | head -1`
modified_key=$(printf '0x%04X\n' "$((high_key ^ 0x8000))")
echo Setting UCX_IB_KEY to $modified_key
export UCX_IB_PKEY=$modified_key
echo Updating /etc/profile.d/ucx_pkey.sh
echo "export UCX_IB_PKEY=$modified_key" | sudo tee -a /etc/profile.d/ucx_pkey.sh
@gcormier Great!!
@jladd-mlnx Should this be added to the OMPI FAQ as well?
As someone who knows next to anything in this world, it would have been useful to have a "zero to hero" script that takes an fresh instances and runs pingpong over IB on Azure.
I would suggest such a script could be made by combining https://github.com/gcormier/hpc-fvcom/blob/master/azure/packer-ubuntu.sh as well as the tips above (with a few logouts required)
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
2.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Ubuntu 18.04LTS default repository
Please describe the system on which you are running
ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 16.23.1020 node_guid: 0015:5dff:fe33:ff5f sys_image_guid: 9803:9b03:000c:6d1e vendor_id: 0x02c9 vendor_part_id: 4120 hw_ver: 0x0 board_id: MT_0000000010 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 780 port_lmc: 0x00 link_layer: InfiniBand
[Mellanox ConnectX5] vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f vendor_part_id = 4119,4120,4121 use_eager_rdma = 1 mtu = 4096 max_inline_data = 256