openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 427 forks source link

OpenMPI stray message on each packet #5708

Open paboyle opened 4 years ago

paboyle commented 4 years ago

Describe the bug

Followed instructions on https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI

Quad MI-50 and Rome node .

Getting repeat messages like

[ga001.alces.network:114497] Read -1, expected 10616832, errno = 14

With every MPI packet sent from device memory. No such message if from host memory. Programme runs normally otherwise as far as I can tell.

Advice welcome.

Steps to Reproduce

Running www.github.com/paboyle/Grid Tricky to configure, though, the HIP support is experimental.

$OMPI_DIR/bin/mpirun -np 2 -mca btl '^openib' Benchmark_comms --grid 16.16.16.64 --mpi 1.1.1.2

# UCT version=1.8.0 revision c30b7da
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations --prefix=/cosma/home/dr002/dc-boyl1/install/ucx --with-rocm=/opt/rocm --without-knem --without-cuda

Setup and versions

cat /etc/redhat-release 
CentOS Linux release 7.7.1908 (Core)
rocm-smi --showdriverversion      
========================ROCm System Management Interface========================
Driver version: 5.6.12
==============================End of ROCm SMI Log ==============================
rocm-smi --showproductname
========================ROCm System Management Interface========================
================================================================================
GPU[1]      : Card series:      Vega 20
GPU[1]      : Card vendor:      Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1]      : Card SKU:     D16317
GPU[2]      : Card series:      Vega 20
GPU[2]      : Card vendor:      Advanced Micro Devices, Inc. [AMD/ATI]
GPU[2]      : Card SKU:     D16317
GPU[3]      : Card series:      Vega 20
GPU[3]      : Card vendor:      Advanced Micro Devices, Inc. [AMD/ATI]
GPU[3]      : Card SKU:     D16317
================================================================================
==============================End of ROCm SMI Log ==============================

Additional information (depending on the issue)

                Open MPI: 4.0.3rc4
souravzzz commented 4 years ago

Hi @paboyle Is this test running on single node or multi node? Can you please provide output for the following two commands?

$ ./ucx_perftest -c 1 & sleep 1 && ./ucx_perftest -c 2 localhost -t tag_bw -m rocm -s 8 -n 1000 $ ./ucx_perftest -c 1 & sleep 1 && ./ucx_perftest -c 2 localhost -t tag_bw -m rocm -s 1048576 -n 1000

paboyle commented 4 years ago

Hi, thanks !

[dc-boyl1@ga001 bin]$ ./ucx_perftest -c 1 & sleep 1 && ./ucx_perftest -c 2 localhost -t tag_bw -m rocm -s 8 -n 1000
[1] 55605
Waiting for connection...
+------------------------------------------------------------------------------------------+
+--------------+-----------------------------+---------------------+-----------------------+
| API:          protocol layer                                                             |
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Test:         tag match bandwidth                                                        |
| Data layout:  (automatic)                                                                |
| Send memory:  rocm                                                                       |
| Recv memory:  rocm                                                                       |
| Message size: 8                                                                          |
+------------------------------------------------------------------------------------------+
[1600367248.484529] [ga001:55608:0]         parser.c:1600 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1600367248.484540] [ga001:55605:0]         parser.c:1600 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
          1000     0.000     1.863     1.863       4.10       4.10      536768      536768

and

[dc-boyl1@ga001 bin]$  ./ucx_perftest -c 1 & sleep 1 && ./ucx_perftest -c 2 localhost -t tag_bw -m rocm -s 1048576 -n 1000
[2] 55678
Waiting for connection...
[1]   Done                    ./ucx_perftest -c 1
+------------------------------------------------------------------------------------------+
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
| API:          protocol layer                                                             |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Test:         tag match bandwidth                                                        |
| Data layout:  (automatic)                                                                |
| Send memory:  rocm                                                                       |
| Recv memory:  rocm                                                                       |
| Message size: 1048576                                                                    |
+------------------------------------------------------------------------------------------+
[1600367305.221433] [ga001:55678:0]         parser.c:1600 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1600367305.221448] [ga001:55681:0]         parser.c:1600 UCX  WARN  unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
          1000     0.000   364.840   364.840    2740.93    2740.93        2741        2741
[ga001:55678:0:55678] rocm_ipc_cache.c:56   Fatal: failed to unmap addr:0x2ad1c7600000

/cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucs/type/class.c: [ ucs_class_call_cleanup_chain() ]
      ...
       49     /* Call remaining destructors */
       50     while (c != NULL) {
       51         c->cleanup(obj);
==>    52         c = c->superclass;
       53     }
       54 }
       55 

[dc-boyl1@ga001 bin]$ ==== backtrace (tid:  55678) ====
 0 0x0000000000050e3e ucs_debug_print_backtrace()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucs/debug/debug.c:625
 1 0x000000000005cfc6 ucs_class_call_cleanup_chain()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucs/type/class.c:52
 2 0x00000000000054a8 uct_rocm_ipc_ep_t_delete()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/uct/rocm/ipc/rocm_ipc_ep.c:40
 3 0x0000000000016ae8 ucp_ep_cleanup_lanes()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucp/core/ucp_ep.c:765
 4 0x0000000000016b49 ucp_ep_destroy_internal()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucp/core/ucp_ep.c:731
 5 0x0000000000021377 ucp_worker_destroy_eps()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucp/core/ucp_worker.c:1855
 6 0x0000000000021377 ucp_worker_destroy()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucp/core/ucp_worker.c:1876
 7 0x0000000000407db2 ucp_perf_cleanup()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/tools/perf/lib/libperf.c:1573
 8 0x000000000040a5e9 ucx_perf_run()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/tools/perf/lib/libperf.c:1669
 9 0x00000000004061e4 run_test_recurs()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/tools/perf/perftest.c:1464
10 0x00000000004044ee run_test()  /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/tools/perf/perftest.c:1513
11 0x0000000000022505 __libc_start_main()  ???:0
12 0x0000000000404a3c _start()  ???:0
=================================

Second looks fatal, but really nice backtrace - someone is working hard to have that source code printing flight log. Never seen anyone do that before.

paboyle commented 4 years ago

p.s. it's still single node. Two AMD MI-50 GPU's connected to Rome socket over PCIe Gen4. We have two nodes and going to multimode is the next step once we get single node happy.

paboyle commented 4 years ago

multinode

souravzzz commented 4 years ago

@paboyle Thanks for the stacktrace. Which version of ROCm stack are you using ($ apt show rocm-dkms)? This failure (Fatal: failed to unmap addr) looks like something we fixed in ROCm 3.7. Can you please try that if you are on an older version?

However, this looks different from the original issue you posted. Is the application trying to send 10MB+ messages (10616832 bytes)?

paboyle commented 4 years ago

Hi,

it's on rocm-3.5, but there's no apt as I guess it's redhat I don't manage the system, so needs to interact with the admins to get this changed.

rpm -q -a | grep rocm
hip-base-3.5.20214.5355_rocm_rel_3.5_30_a2917cdc-1.x86_64
hip-doc-3.5.20214.5355_rocm_rel_3.5_30_a2917cdc-1.x86_64
rocm-clang-ocl-0.5.0.51_rocm_rel_3.5_30_74b3b81-1.x86_64
rocm-device-libs-1.0.0.585_rocm_rel_3.5_30_e6d1be0-1.x86_64
rocm-cmake-0.3.0.153_rocm_rel_3.5_30_1d1caa5-1.x86_64
rocm-opencl-2.0.20191-1.x86_64
rocm-utils-3.5.0_30-1.x86_64
rocm-smi-1.0.0_201_rocm_rel_3.5_30_gcdfbef4-1.x86_64
hsa-ext-rocr-dev-1.1.30500.0_rocm_rel_3.5_30_def83d8a-1.x86_64
rocm-debug-agent-1.0.0.30500_rocm_rel_3.5_30-1.x86_64
rocm-dev-3.5.0_30-1.x86_64
rocm-validation-suite-3.0.0-1.x86_64
hsa-rocr-dev-1.1.30500.0_rocm_rel_3.5_30_def83d8a-1.x86_64
rocm-opencl-devel-2.0.20191-1.x86_64
hip-rocclr-3.5.20214.5355_rocm_rel_3.5_30_a2917cdc-1.x86_64
rocm-smi-lib64-2.3.0.8.rocm_rel_3.5_30_2143bc3-1.x86_64
rocm-libs-3.7.0_20-1.x86_64
rocm-gdb-9.1_rocm_rel_3.5_30-1.x86_64
rocminfo-1.30500.0-1.x86_64
comgr-1.6.0.143_rocm_rel_3.5_30_e24e8c1-1.x86_64
hip-samples-3.5.20214.5355_rocm_rel_3.5_30_a2917cdc-1.x86_64
rocm-dbgapi-0.21.2_rocm_rel_3.5_30-1.x86_64
paboyle commented 4 years ago

Hi - yes, at times the packets exceed 10 MB

souravzzz commented 4 years ago

@paboyle These two issues look relevant, can you take a look? https://github.com/horovod/horovod/issues/503 https://github.com/horovod/horovod/issues/243#issuecomment-381295853

paboyle commented 4 years ago

Looks like the same message, but oddly only happens when I use MPI on device memory, but not host memory in my code. Don't believe we're virtualised or using Docker. It's a regular Linux cluster node with direct login, ssh access.