Open jedbrown opened 4 years ago
Hi @jedbrown We fixed several IPC issues in ROCm 3.7 release. Can you give it a try? It's recommended to uninstall any older ROCm version before installing 3.7.
I've been using ROCM-3.7 successfully for a few days, but unfortunately, this UCX error is still present with fresh rebuilds of ucx, ompi, and the osu benchmark.
$ $OMPI_DIR/bin/mpiexec -n 2 --mca btl '^openib' mpi/pt2pt/osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.64
2 1.37
4 2.75
8 5.56
16 10.96
32 11.12
64 11.82
128 12.70
256 19.80
512 40.31
1024 69.53
2048 113.14
4096 155.10
8192 184.66
[1598594643.109801] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[1598594643.109821] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[1598594643.109826] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[1598594643.109829] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[1598594643.109834] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[...]
I'm still seeing this issue with ROCm-4.0 and a fresh build of today's ucx (1d22f7486ef4202da30ee811a95ad394b862b9a1) and ompi (8ff2277b7e48b899341f69a9f3f9c9ee7cecf476).
$ $OMPI_DIR/bin/mpiexec -n 2 --mca btl '^openib' mpi/pt2pt/osu_bw -d rocm D D
[noether:1378737] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378737] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378758] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378759] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378758] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378759] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.53
2 0.62
4 1.24
8 2.52
16 5.14
32 10.28
64 27.05
128 13.13
256 32.39
512 23.62
1024 23.42
2048 23.28
4096 23.27
8192 23.23
[1608394399.158674] [noether:1378758:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158695] [noether:1378758:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158699] [noether:1378758:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158702] [noether:1378758:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[...]
Same issue is also present with today's MPICH 'main' (dac05cf7f9ec1a59e2d917f3da80fc943f378872)
$ $MPICH_DIR/bin/mpiexec -n 2 mpi/pt2pt/osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.51
2 0.64
4 1.24
8 2.69
16 5.46
32 11.48
64 28.43
128 14.47
256 34.43
512 26.37
1024 26.14
2048 25.80
4096 25.81
8192 25.71
[1608692521.150727] [noether:1682552:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150754] [noether:1682552:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150758] [noether:1682552:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150760] [noether:1682552:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7f657e600000/4000
Hi @jedbrown, I believe it's a ROCR issue and not UCX. I would like to send you some test programs to debug it further. Is the email @jedbrown.org good to reach?
Yes, thanks.
The root cause was confirmed to be with ROCR support for Radeon VII and not UCX. An internal issue has been raised to resolve this.
@jedbrown, ROCr supports IPC on Radeon VII. The trouble here is it seems you are using the upstream amdgpu driver. This driver does not support IPC on any device. For IPC support you will need to install our DKMS amdgpu driver package. Unfortunately Debian is not a supported OS (Ubuntu is however) so our DKMS package may not install against your kernel.
Thanks, is the support going upstream? For various reasons, I'm not going to switch distributions, but we're currently on Linux 5.10 and follow the usual upgrades.
Has there been any update on this?
@simonbyrne can you please try with the ucx 1.13.0-rc1 release? We fixed an issue with ipc creation, (although we didn't test on ROCm 3.x, but we did on ROCm 4.5 and ROCm 5.x)
Thanks, 1.13-rc1 did fix my issue (I'm not sure what ROCm version it is using)
Describe the bug
I've been following these instructions for ROCm-aware MPI on a Zen2 server node with a Radeon VII and ROCm-3.5.0. The large bar test passed, and the builds all went smoothly, but the OSU test
The relevant function is
Steps to Reproduce
Install latest versions per https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI and try the intranode test.
Setup and versions
Linux noether 5.7.0-1-amd64 #1 SMP Debian 5.7.6-1 (2020-06-24) x86_64 GNU/Linux
rocminfo
:Additional information (depending on the issue)