openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 427 forks source link

rocm_ipc_md.c:69 UCX ERROR Failed to create ipc #5485

Open jedbrown opened 4 years ago

jedbrown commented 4 years ago

Describe the bug

I've been following these instructions for ROCm-aware MPI on a Zen2 server node with a Radeon VII and ROCm-3.5.0. The large bar test passed, and the builds all went smoothly, but the OSU test

$ $OMPI_DIR/bin/mpirun -n 2 --mca btl '^openib' mpi/pt2pt/osu_bw -d rocm D D
[1595887875.771550] [noether:3621053:0]         parser.c:1626 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1595887875.771658] [noether:3621052:0]         parser.c:1626 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
# OSU MPI-ROCM Bandwidth Test v5.3.2                                                                                                                                                           
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)                                                                                                                                   
# Size      Bandwidth (MB/s)                                                                                                                                                                   
1                       0.67                                                                                                                                                                   
2                       1.32                                                                                                                                                                   
4                       2.66
8                       5.28
16                     10.73
32                     11.04
64                     11.86
128                    12.51
256                    19.71
512                    38.49
1024                   68.15
2048                  121.26
4096                  158.25
8192                  167.94
[1595887877.188205] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188222] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188226] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188230] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188233] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188235] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188238] [noether:3621052:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7f99efe00000/4000
[...many screens of similar output...]
65536                 205.33
[...]

The relevant function is

static hsa_status_t uct_rocm_ipc_pack_key(void *address, size_t length,
                                          uct_rocm_ipc_key_t *key)
{
    hsa_status_t status;
    hsa_agent_t agent;
    void *base_ptr;
    size_t size;

    status = uct_rocm_base_get_ptr_info(address, length, &base_ptr, &size, &agent);
    if (status != HSA_STATUS_SUCCESS) {
        ucs_error("pack none ROCM ptr %p/%lx", address, length);
        return status;
    }

    status = hsa_amd_ipc_memory_create(base_ptr, size, &key->ipc);
    if (status != HSA_STATUS_SUCCESS) {
        ucs_error("Failed to create ipc for %p/%lx", address, length);
        return status;
    }

    key->address = (uintptr_t)base_ptr;
    key->length = size;
    key->dev_num = uct_rocm_base_get_dev_num(agent);

    return HSA_STATUS_SUCCESS;
}

Steps to Reproduce

Install latest versions per https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI and try the intranode test.

# UCT version=1.10.0 revision bae84af
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations --prefix=/projects/ucx/ucx --with-rocm=/opt/rocm --without-knem --without-cuda

Setup and versions

souravzzz commented 4 years ago

Hi @jedbrown We fixed several IPC issues in ROCm 3.7 release. Can you give it a try? It's recommended to uninstall any older ROCm version before installing 3.7.

jedbrown commented 4 years ago

I've been using ROCM-3.7 successfully for a few days, but unfortunately, this UCX error is still present with fresh rebuilds of ucx, ompi, and the osu benchmark.

$ $OMPI_DIR/bin/mpiexec -n 2 --mca btl '^openib'  mpi/pt2pt/osu_bw -d rocm D D                                                
# OSU MPI-ROCM Bandwidth Test v5.3.2                                                                                                                                                           
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)                                                                                                                                   
# Size      Bandwidth (MB/s)                                                                                                                                                                   
1                       0.64                                                                                                                                                                   
2                       1.37                                                                                                                                                                   
4                       2.75                                                                                                                                                                   
8                       5.56                                                                                                                                                                   
16                     10.96                                                                                                                                                                   
32                     11.12                                                                                                                                                                   
64                     11.82                                                                                                                                                                   
128                    12.70                                                                                                                                                                   
256                    19.80                                                                                                                                                                   
512                    40.31                                                                                                                                                                   
1024                   69.53                                                                                                                                                                   
2048                  113.14                                                                                                                                                                   
4096                  155.10                                                                                                                                                                   
8192                  184.66                                                                                                                                                                   
[1598594643.109801] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000                                                                          
[1598594643.109821] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000                                                                          
[1598594643.109826] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000                                                                          
[1598594643.109829] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000                                                                          
[1598594643.109834] [noether:2043685:0]    rocm_ipc_md.c:69   UCX  ERROR Failed to create ipc for 0x7fd9d7e00000/4000     
[...]
jedbrown commented 3 years ago

I'm still seeing this issue with ROCm-4.0 and a fresh build of today's ucx (1d22f7486ef4202da30ee811a95ad394b862b9a1) and ompi (8ff2277b7e48b899341f69a9f3f9c9ee7cecf476).

$ $OMPI_DIR/bin/mpiexec -n 2 --mca btl '^openib'  mpi/pt2pt/osu_bw -d rocm D D
[noether:1378737] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378737] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378758] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378759] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378758] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378759] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.53
2                       0.62
4                       1.24
8                       2.52
16                      5.14
32                     10.28
64                     27.05
128                    13.13
256                    32.39
512                    23.62
1024                   23.42
2048                   23.28
4096                   23.27
8192                   23.23
[1608394399.158674] [noether:1378758:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158695] [noether:1378758:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158699] [noether:1378758:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158702] [noether:1378758:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[...]
jedbrown commented 3 years ago

Same issue is also present with today's MPICH 'main' (dac05cf7f9ec1a59e2d917f3da80fc943f378872)

$ $MPICH_DIR/bin/mpiexec -n 2 mpi/pt2pt/osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.51
2                       0.64
4                       1.24
8                       2.69
16                      5.46
32                     11.48
64                     28.43
128                    14.47
256                    34.43
512                    26.37
1024                   26.14
2048                   25.80
4096                   25.81
8192                   25.71
[1608692521.150727] [noether:1682552:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150754] [noether:1682552:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150758] [noether:1682552:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150760] [noether:1682552:0]    rocm_ipc_md.c:70   UCX  ERROR Failed to create ipc for 0x7f657e600000/4000
souravzzz commented 3 years ago

Hi @jedbrown, I believe it's a ROCR issue and not UCX. I would like to send you some test programs to debug it further. Is the email @jedbrown.org good to reach?

jedbrown commented 3 years ago

Yes, thanks.

souravzzz commented 3 years ago

The root cause was confirmed to be with ROCR support for Radeon VII and not UCX. An internal issue has been raised to resolve this.

skeelyamd commented 3 years ago

@jedbrown, ROCr supports IPC on Radeon VII. The trouble here is it seems you are using the upstream amdgpu driver. This driver does not support IPC on any device. For IPC support you will need to install our DKMS amdgpu driver package. Unfortunately Debian is not a supported OS (Ubuntu is however) so our DKMS package may not install against your kernel.

jedbrown commented 3 years ago

Thanks, is the support going upstream? For various reasons, I'm not going to switch distributions, but we're currently on Linux 5.10 and follow the usual upgrades.

simonbyrne commented 2 years ago

Has there been any update on this?

edgargabriel commented 2 years ago

@simonbyrne can you please try with the ucx 1.13.0-rc1 release? We fixed an issue with ipc creation, (although we didn't test on ROCm 3.x, but we did on ROCm 4.5 and ROCm 5.x)

simonbyrne commented 2 years ago

Thanks, 1.13-rc1 did fix my issue (I'm not sure what ROCm version it is using)