Open and-1 opened 1 year ago
I tried to reproduce the issue., but it works on my systems. I used an MI100 system with UCX 1.14.1, rocm 5.4.x, Open MPI v5.0.x (the branch, not an RC, but should be minimal difference), and osu 5.9. All my tests work.
root@ixt-sjc2-07:/home/egabriel/osu-micro-benchmarks-5.9/mpi/pt2pt# mpirun -x UCX_RNDV_SCHEME=put_zcopy -x UCX_TLS=sm,self,rocm --allow-run-as-root --mca pml ucx -np 2 ./osu_bw D D
# OSU MPI-ROCM Bandwidth Test v5.9
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.60
2 0.73
4 1.47
8 2.88
16 5.80
32 11.73
64 31.47
128 16.16
256 40.16
512 29.75
1024 29.54
2048 234.41
4096 364.46
8192 608.75
16384 1508.76
32768 2912.71
65536 5385.83
131072 9415.25
262144 15098.03
524288 21557.33
1048576 27462.94
2097152 31711.97
4194304 34328.98
So we probably need to identify what is the difference between your system and my system. The first things are probably, could you please provide your PATH and LD_LIBRARY_PATH environment variables that you used while running the tests? And second, can you please confirm that the large BAR test shown here https://github.com/openucx/ucx/wiki/Build-and-run-ROCm-UCX-OpenMPI#Sanity-Check-for-Large-BAR-setting works correctly on your system?
I noticed btw. also some errors in your ucx_info output that are unusual, indicating that something is not entirely right on the platform.
# Transport: tcp
# Device: int0
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.32/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# [1690299590.847174] [localhost.i:967330:0] sys.c:140 UCX ERROR mremap(oldptr=0x7fb7b7ee0000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1690299590.847384] [localhost.i:967330:0] sys.c:140 UCX ERROR
This is an error that I have not seen so far, so we should probably try to clarify where this is coming from.
@and-1 can you pls run the ucx_info command after setting the following environment variables:
export UCX_LOG_LEVEL_TRIGGER=error
export UCX_HANDLE_ERRORS=bt
This will print a backtrack with the origin of the mmap error.
Also, is it possible vm.max_map_count is too small on your system?
@edgargabriel of course:
env | grep -e ^PATH -e ^LD_LIBRARY_PATH
LD_LIBRARY_PATH=/root/ompi_for_gpu/ompi/lib:/root/ompi_for_gpu/ucx/lib:/opt/rocm/lib
PATH=/root/ompi_for_gpu/ompi/bin:/root/.local/bin:/root/bin:/home/and-1/.local/bin:/home/and-1/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin:/root/ompi_for_gpu/ucx/bin/:/opt/rocm-5.4.3/llvm/bin/
large BAR test executed without errors
./check_large_bar
address buf 0x7fcc9c400000
Buf[0] = -1094795586
Buf[0] = 1
@yosefe setting this envs don't affect ucx_info -d
output, result the same as i attached before. Maybe setting UCX_MEM_LOG_LEVEL=debug, UCX_LOG_LEVEL=data
help you to understand problem
ucx_info.txt
vm.max_map_count has default value
cat /proc/sys/vm/max_map_count
262144
@and-1 maybe https://github.com/openucx/ucx/pull/8822 can fix the issue?
Unfortunately no, all the same
Are you executing in a docker container? If yes, maybe you could try the same test using e.g. an Ubuntu 20.04 or 22.04 container?
AlmaLinux is not an officially support distribution by ROCm. I understand based on some webpages that it is supposedly binary compatible to RHEL, but you never know. In addition, if AlmaLinux 9.2 is equivalent to RHEL 9.2, this might also be an issue, since the ROCm 5.4.x series is only validated up to RHEL 9.1
I have some updates on this ticket. I managed to get my hands on a RHEL 9.2 system with ROCm 5.6.0 and ran some tests.
the osu bw tests for rocm devices still finished correctly for me on RHEL 9.2 (note, this is a different GPU than yours, notably without XGMI links)
mpirun --mca pml ucx -np 2 ./osu_bw d d
[1691157609.444279] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3d49000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.444745] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3d48000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.444956] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3d42000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.445294] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3d02000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.445502] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3cce000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.445828] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3cc4000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.445920] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3cb3000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.446094] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3ca9000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.446844] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3c8a000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.447199] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bffcc9000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.447418] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bffcc8000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.447618] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bffcc7000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.447964] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bffc4f000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.447975] sys.c:140 UCX ERROR mremap(oldptr=0x7fa0b3c68000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.448302] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bffc45000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.448398] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bffc34000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.448569] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bffc2a000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.449267] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bffc0b000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1691157609.450210] sys.c:140 UCX ERROR mremap(oldptr=0x7f1bff9e0000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
# OSU MPI Bandwidth Test v7.0
# Size Bandwidth (MB/s)
1 7.05
2 15.79
4 31.60
8 64.50
16 121.32
32 245.13
64 509.42
128 547.53
256 990.40
512 1875.25
1024 3231.27
2048 5252.97
4096 6574.84
8192 9764.14
16384 12589.82
32768 17356.26
65536 21059.10
131072 24763.85
262144 26406.89
524288 26726.06
1048576 13986.17
2097152 13099.02
4194304 15000.84
@edgargabriel can you pls set UCX_LOG_LEVEL_TRIGGER=error, to track the backtrace of the problematic mremap?
@yosefe, the variable didn't provide any insights, but I got a backtrace from a debugger run, it seems that the issue is initiated from the ucm/rocmem . Will need to investigate.
Thread 1 "a.out" hit Breakpoint 1, ucm_sys_realloc (ptr=0x7ffff7a0b008, size=4096) at ../../../src/ucm/util/sys.c:140
140 ucm_error("mremap(oldptr=%p oldsize=%zu, newsize=%zu) failed: %m",
Missing separate debuginfos, use: dnf debuginfo-install comgr-2.5.0.50700-crdb.3325.el9.x86_64 elfutils-libelf-0.188-3.el9.x86_64 glibc-2.34-60.el9.x86_64 hip-runtime-amd-5.6.31101.50700-crdb.3325.el9.x86_64 hsa-rocr-1.9.0.50700-crdb.3325.el9.x86_64 libgcc-11.3.1-4.3.el9.x86_64 libstdc++-11.3.1-4.3.el9.x86_64 libxml2-2.9.13-3.el9_2.1.x86_64 libzstd-1.5.1-2.el9.x86_64 ncurses-libs-6.2-8.20210508.el9.x86_64 numactl-libs-2.0.14-9.el9.x86_64 xz-libs-5.2.5-8.el9_0.x86_64
(gdb) where
#0 ucm_sys_realloc (ptr=0x7ffff7a0b008, size=4096) at ../../../src/ucm/util/sys.c:140
#1 0x00007ffff7f9d3ee in kh_resize_ucm_dl_symbol_hash (h=0x7ffff7fbf318, new_n_buckets=512) at ../../../src/ucm/util/reloc.c:61
#2 0x00007ffff7f9d849 in kh_put_ucm_dl_symbol_hash (h=0x7ffff7fbf318, key=0x7ffff7aa19bc "uct_tcp_cm_send_event", ret=0x7fffffffc840)
at ../../../src/ucm/util/reloc.c:61
#3 0x00007ffff7f9f108 in ucm_dl_populate_symbols (dl_info=0x7ffff7fbf318, dlpi_addr=140737348481024, table=0x7ffff7aaa500,
table_size=7728, strtab=0x7ffff7a9f540, symtab=0x7ffff7a9bcd0, dl_name=0x7ffff7fc6860 "/home/taccuser/edgar/UCX/lib/libuct.so.0")
at ../../../src/ucm/util/reloc.c:281
#4 0x00007ffff7f9f5e5 in ucm_reloc_dl_info_get (phdr_info=0x7fffffffcae0,
dl_name=0x7ffff7fc6860 "/home/taccuser/edgar/UCX/lib/libuct.so.0", dl_info_p=0x7fffffffca80) at ../../../src/ucm/util/reloc.c:375
#5 0x00007ffff7f9f9f1 in ucm_reloc_phdr_iterator (phdr_info=0x7fffffffcae0, size=64, data=0x7fffffffcb80)
at ../../../src/ucm/util/reloc.c:465
#6 0x00007ffff79961a4 in dl_iterate_phdr () from /lib64/libc.so.6
#7 0x00007ffff7f9fb44 in ucm_reloc_apply_patch (patch=0x7ffff7fb1e20 <ucm_dlopen_reloc_patches>, libucm_base_addr=0)
at ../../../src/ucm/util/reloc.c:502
#8 0x00007ffff7fa037b in ucm_reloc_install_dl_hooks () at ../../../src/ucm/util/reloc.c:716
#9 0x00007ffff7fa046d in ucm_reloc_modify (patch=0x7ffff73330c0 <patches>) at ../../../src/ucm/util/reloc.c:748
#10 0x00007ffff7330a3b in ucm_rocmmem_install (events=2097152) at ../../../../src/ucm/rocm/rocmmem.c:156
#11 0x00007ffff7f97123 in ucm_event_install (events=2097152) at ../../../src/ucm/event/event.c:553
#12 0x00007ffff7f971f7 in ucm_set_event_handler (events=2097152, priority=1000, cb=0x7ffff7a5e01a <ucs_rcache_unmapped_callback>,
arg=0x72ab60) at ../../../src/ucm/event/event.c:596
#13 0x00007ffff7a60323 in ucs_rcache_t_init (self=0x72ab60, _myclass=0x7ffff7a95800 <ucs_rcache_t_class>, _init_count=0x7fffffffcde8,
params=0x7fffffffce60, name=0x7fffebb959bc "rocm_copy", stats_parent=0x0) at ../../../src/ucs/memory/rcache.c:1266
#14 0x00007ffff7a60672 in ucs_rcache_create (arg0=0x7fffffffce60, arg1=0x7fffebb959bc "rocm_copy", arg2=0x0, obj_p=0x747500)
at ../../../src/ucs/memory/rcache.c:1326
--Type <RET> for more, q to quit, c to continue without paging--
#15 0x00007fffebb8ec13 in uct_rocm_copy_md_open (component=0x7fffebb9a620 <uct_rocm_copy_component>, md_name=0x7fffffffcf80 "rocm_cpy",
config=0x73e150, md_p=0x7fffffffcf20) at ../../../../src/uct/rocm/copy/rocm_copy_md.c:439
#16 0x00007ffff7aae5dd in uct_md_open (component=0x7fffebb9a620 <uct_rocm_copy_component>, md_name=0x7fffffffcf80 "rocm_cpy",
config=0x73e150, md_p=0x73b750) at ../../../src/uct/base/uct_md.c:95
#17 0x00007ffff743240d in ucp_fill_tl_md (context=0x739110, cmpt_index=4 '\004', md_rsc=0x7fffffffcf80, tl_md=0x73b750)
at ../../../src/ucp/core/ucp_context.c:1293
#18 0x00007ffff7432f20 in ucp_add_component_resources (context=0x739110, cmpt_index=4 '\004', avail_devices=0x7fffffffd110,
avail_tls=0x7fffffffd0e0, dev_cfg_masks=0x7fffffffd1c0, tl_cfg_mask=0x7fffffffd1b8, config=0x6da650, aux_tls=0x7fffffffd0a0)
at ../../../src/ucp/core/ucp_context.c:1488
#19 0x00007ffff7433c55 in ucp_fill_resources (context=0x739110, config=0x6da650) at ../../../src/ucp/core/ucp_context.c:1721
#20 0x00007ffff74353ac in ucp_init_version (api_major_version=1, api_minor_version=16, params=0x7fffffffd2e0, config=0x6da650,
context_p=0x7ffff7f174c0 <ompi_pml_ucx+192>) at ../../../src/ucp/core/ucp_context.c:2165
#21 0x00007ffff7e28bea in mca_pml_ucx_open () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#22 0x00007ffff7b317a2 in mca_base_framework_components_open () from /home/taccuser/edgar/OpenMPI/lib/libopen-pal.so.80
#23 0x00007ffff7e245f7 in mca_pml_base_open () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#24 0x00007ffff7b322e1 in mca_base_framework_open () from /home/taccuser/edgar/OpenMPI/lib/libopen-pal.so.80
#25 0x00007ffff7c95473 in ompi_mpi_instance_init_common () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#26 0x00007ffff7c961a4 in ompi_mpi_instance_init () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#27 0x00007ffff7c88d78 in ompi_mpi_init () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#28 0x00007ffff7cbd33e in PMPI_Init () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#29 0x0000000000401196 in main ()
Describe the bug
I'm trying to run OSU test following doc, but expiriecing with few errors like:
Steps to Reproduce
Setup and versions
AlmaLinux release 9.2 (Turquoise Kodkod)
Linux localhost.i 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 9 05:49:00 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
MLNX_OFED_LINUX-23.04-1.1.3.0 (OFED-23.04-1.1.3)
Additional information (depending on the issue)
ucx-info.txt config.log