Open paboyle opened 4 years ago
Hi @paboyle Is this test running on single node or multi node? Can you please provide output for the following two commands?
$ ./ucx_perftest -c 1 & sleep 1 && ./ucx_perftest -c 2 localhost -t tag_bw -m rocm -s 8 -n 1000 $ ./ucx_perftest -c 1 & sleep 1 && ./ucx_perftest -c 2 localhost -t tag_bw -m rocm -s 1048576 -n 1000
Hi, thanks !
[dc-boyl1@ga001 bin]$ ./ucx_perftest -c 1 & sleep 1 && ./ucx_perftest -c 2 localhost -t tag_bw -m rocm -s 8 -n 1000
[1] 55605
Waiting for connection...
+------------------------------------------------------------------------------------------+
+--------------+-----------------------------+---------------------+-----------------------+
| API: protocol layer |
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: rocm |
| Recv memory: rocm |
| Message size: 8 |
+------------------------------------------------------------------------------------------+
[1600367248.484529] [ga001:55608:0] parser.c:1600 UCX WARN unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1600367248.484540] [ga001:55605:0] parser.c:1600 UCX WARN unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
1000 0.000 1.863 1.863 4.10 4.10 536768 536768
and
[dc-boyl1@ga001 bin]$ ./ucx_perftest -c 1 & sleep 1 && ./ucx_perftest -c 2 localhost -t tag_bw -m rocm -s 1048576 -n 1000
[2] 55678
Waiting for connection...
[1] Done ./ucx_perftest -c 1
+------------------------------------------------------------------------------------------+
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
| API: protocol layer |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: rocm |
| Recv memory: rocm |
| Message size: 1048576 |
+------------------------------------------------------------------------------------------+
[1600367305.221433] [ga001:55678:0] parser.c:1600 UCX WARN unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1600367305.221448] [ga001:55681:0] parser.c:1600 UCX WARN unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
1000 0.000 364.840 364.840 2740.93 2740.93 2741 2741
[ga001:55678:0:55678] rocm_ipc_cache.c:56 Fatal: failed to unmap addr:0x2ad1c7600000
/cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucs/type/class.c: [ ucs_class_call_cleanup_chain() ]
...
49 /* Call remaining destructors */
50 while (c != NULL) {
51 c->cleanup(obj);
==> 52 c = c->superclass;
53 }
54 }
55
[dc-boyl1@ga001 bin]$ ==== backtrace (tid: 55678) ====
0 0x0000000000050e3e ucs_debug_print_backtrace() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucs/debug/debug.c:625
1 0x000000000005cfc6 ucs_class_call_cleanup_chain() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucs/type/class.c:52
2 0x00000000000054a8 uct_rocm_ipc_ep_t_delete() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/uct/rocm/ipc/rocm_ipc_ep.c:40
3 0x0000000000016ae8 ucp_ep_cleanup_lanes() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucp/core/ucp_ep.c:765
4 0x0000000000016b49 ucp_ep_destroy_internal() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucp/core/ucp_ep.c:731
5 0x0000000000021377 ucp_worker_destroy_eps() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucp/core/ucp_worker.c:1855
6 0x0000000000021377 ucp_worker_destroy() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/ucp/core/ucp_worker.c:1876
7 0x0000000000407db2 ucp_perf_cleanup() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/tools/perf/lib/libperf.c:1573
8 0x000000000040a5e9 ucx_perf_run() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/tools/perf/lib/libperf.c:1669
9 0x00000000004061e4 run_test_recurs() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/tools/perf/perftest.c:1464
10 0x00000000004044ee run_test() /cosma/home/dr002/dc-boyl1/ucx/contrib/../src/tools/perf/perftest.c:1513
11 0x0000000000022505 __libc_start_main() ???:0
12 0x0000000000404a3c _start() ???:0
=================================
Second looks fatal, but really nice backtrace - someone is working hard to have that source code printing flight log. Never seen anyone do that before.
p.s. it's still single node. Two AMD MI-50 GPU's connected to Rome socket over PCIe Gen4. We have two nodes and going to multimode is the next step once we get single node happy.
multinode
@paboyle Thanks for the stacktrace. Which version of ROCm stack are you using ($ apt show rocm-dkms)? This failure (Fatal: failed to unmap addr) looks like something we fixed in ROCm 3.7. Can you please try that if you are on an older version?
However, this looks different from the original issue you posted. Is the application trying to send 10MB+ messages (10616832 bytes)?
Hi,
it's on rocm-3.5, but there's no apt as I guess it's redhat I don't manage the system, so needs to interact with the admins to get this changed.
rpm -q -a | grep rocm
hip-base-3.5.20214.5355_rocm_rel_3.5_30_a2917cdc-1.x86_64
hip-doc-3.5.20214.5355_rocm_rel_3.5_30_a2917cdc-1.x86_64
rocm-clang-ocl-0.5.0.51_rocm_rel_3.5_30_74b3b81-1.x86_64
rocm-device-libs-1.0.0.585_rocm_rel_3.5_30_e6d1be0-1.x86_64
rocm-cmake-0.3.0.153_rocm_rel_3.5_30_1d1caa5-1.x86_64
rocm-opencl-2.0.20191-1.x86_64
rocm-utils-3.5.0_30-1.x86_64
rocm-smi-1.0.0_201_rocm_rel_3.5_30_gcdfbef4-1.x86_64
hsa-ext-rocr-dev-1.1.30500.0_rocm_rel_3.5_30_def83d8a-1.x86_64
rocm-debug-agent-1.0.0.30500_rocm_rel_3.5_30-1.x86_64
rocm-dev-3.5.0_30-1.x86_64
rocm-validation-suite-3.0.0-1.x86_64
hsa-rocr-dev-1.1.30500.0_rocm_rel_3.5_30_def83d8a-1.x86_64
rocm-opencl-devel-2.0.20191-1.x86_64
hip-rocclr-3.5.20214.5355_rocm_rel_3.5_30_a2917cdc-1.x86_64
rocm-smi-lib64-2.3.0.8.rocm_rel_3.5_30_2143bc3-1.x86_64
rocm-libs-3.7.0_20-1.x86_64
rocm-gdb-9.1_rocm_rel_3.5_30-1.x86_64
rocminfo-1.30500.0-1.x86_64
comgr-1.6.0.143_rocm_rel_3.5_30_e24e8c1-1.x86_64
hip-samples-3.5.20214.5355_rocm_rel_3.5_30_a2917cdc-1.x86_64
rocm-dbgapi-0.21.2_rocm_rel_3.5_30-1.x86_64
Hi - yes, at times the packets exceed 10 MB
@paboyle These two issues look relevant, can you take a look? https://github.com/horovod/horovod/issues/503 https://github.com/horovod/horovod/issues/243#issuecomment-381295853
Looks like the same message, but oddly only happens when I use MPI on device memory, but not host memory in my code. Don't believe we're virtualised or using Docker. It's a regular Linux cluster node with direct login, ssh access.
Describe the bug
Followed instructions on https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI
Quad MI-50 and Rome node .
Getting repeat messages like
With every MPI packet sent from device memory. No such message if from host memory. Programme runs normally otherwise as far as I can tell.
Advice welcome.
Steps to Reproduce
Running www.github.com/paboyle/Grid Tricky to configure, though, the HIP support is experimental.
Setup and versions
Additional information (depending on the issue)