openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

OSU works with errors on server with AMD gpu #9237

Open and-1 opened 1 year ago

and-1 commented 1 year ago

Describe the bug

I'm trying to run OSU test following doc, but expiriecing with few errors like:

$OMPI_DIR/bin/mpirun -np 2 -x UCX_TLS=sm,self,rocm --allow-run-as-root -x HIP_VISIBLE_DEVICES=0,1 -x UCX_RNDV_SCHEME=put_zcopy --mca pml ucx mpi/pt2pt/osu_bw -d rocm D D
[1690292071.482590] [localhost.i:777081:0]              sys.c:140  UCX  ERROR mremap(oldptr=0x7fca85e8d000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1690292071.483656] [localhost.i:777081:0]              sys.c:140  UCX  ERROR mremap(oldptr=0x7fca85e5f000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1690292071.486078] [localhost.i:777081:0]              sys.c:140  UCX  ERROR mremap(oldptr=0x7fca85e31000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1690292071.486479] [localhost.i:777081:0]              sys.c:140  UCX  ERROR mremap(oldptr=0x7fca85e1d000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1690292071.574608] [localhost:777081:0]          parser.c:1989 UCX  WARN  unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1690292071.574608] [localhost:777081:0]          parser.c:1989 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1690292071.584505] [localhost.i:777082:0]              sys.c:140  UCX  ERROR mremap(oldptr=0x7fbd01bbd000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1690292071.584925] [localhost.i:777082:0]              sys.c:140  UCX  ERROR mremap(oldptr=0x7fbd01ba9000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1690292071.649853] [localhost:777082:0]          parser.c:1989 UCX  WARN  unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1690292071.649853] [localhost:777082:0]          parser.c:1989 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
# OSU MPI-ROCM Bandwidth Test v5.9
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.66
2                       1.32
4                       2.65
8                       5.28
16                     10.62
32                     21.19
64                     37.94
128                   104.95
256                    81.43
512                   144.44
1024                  165.48
2048                  236.77
4096                  376.21
8192                  624.40
[1690292072.322674] [localhost:777081:0]     rocm_ipc_md.c:78   UCX  ERROR Failed to create ipc for 0x7fc27d400000/4000
[1690292072.322695] [localhost:777081:0]     rocm_ipc_md.c:78   UCX  ERROR Failed to create ipc for 0x7fc27d400000/4000
[...]

Steps to Reproduce

$OMPI_DIR/bin/mpirun -np 2 -x UCX_TLS=sm,self,rocm --allow-run-as-root -x HIP_VISIBLE_DEVICES=0,1 -x UCX_RNDV_SCHEME=put_zcopy --mca pml ucx mpi/pt2pt/osu_bw -d rocm D D
# Library version: 1.14.1
# Library path: /root/ompi_for_gpu/ucx/lib/libucs.so.0
# API headers version: 1.14.1
# Git branch '', revision 04897a0
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check -prefix=/root/ompi_for_gpu/ucx --with-rocm=/opt/rocm --without-cuda -enable-optimizations -disable-assertions --disable-params-check -without-java --enable-logging

Setup and versions

Additional information (depending on the issue)

ompi_info
                 Package: Open MPI and-1@localhost.i Distribution
                Open MPI: 5.0.0rc12
  Open MPI repo revision: v5.0.0rc12-59-g5f5ab938bd
   Open MPI release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 5.0.0rc12
                  Prefix: /root/ompi_for_gpu/ompi
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: and-1
           Configured on: Tue Jul 25 13:10:03 UTC 2023
          Configure host: localhost.i
  Configure command line: '--prefix=/root/ompi_for_gpu/ompi'
                          '--with-ucx=/root/ompi_for_gpu/ucx'
                          '--with-rocm=/opt/rocm'
                          '--enable-mca-no-build=btl-uct'
                          '--enable-mpi1-compatibility' 'CC=clang'
                          'CXX=clang++' 'FC=flang'

ucx-info.txt config.log

edgargabriel commented 1 year ago

I tried to reproduce the issue., but it works on my systems. I used an MI100 system with UCX 1.14.1, rocm 5.4.x, Open MPI v5.0.x (the branch, not an RC, but should be minimal difference), and osu 5.9. All my tests work.

root@ixt-sjc2-07:/home/egabriel/osu-micro-benchmarks-5.9/mpi/pt2pt# mpirun -x UCX_RNDV_SCHEME=put_zcopy -x UCX_TLS=sm,self,rocm --allow-run-as-root --mca pml ucx -np 2 ./osu_bw D D
# OSU MPI-ROCM Bandwidth Test v5.9
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.60
2                       0.73
4                       1.47
8                       2.88
16                      5.80
32                     11.73
64                     31.47
128                    16.16
256                    40.16
512                    29.75
1024                   29.54
2048                  234.41
4096                  364.46
8192                  608.75
16384                1508.76
32768                2912.71
65536                5385.83
131072               9415.25
262144              15098.03
524288              21557.33
1048576             27462.94
2097152             31711.97
4194304             34328.98

So we probably need to identify what is the difference between your system and my system. The first things are probably, could you please provide your PATH and LD_LIBRARY_PATH environment variables that you used while running the tests? And second, can you please confirm that the large BAR test shown here https://github.com/openucx/ucx/wiki/Build-and-run-ROCm-UCX-OpenMPI#Sanity-Check-for-Large-BAR-setting works correctly on your system?

I noticed btw. also some errors in your ucx_info output that are unusual, indicating that something is not entirely right on the platform.

#      Transport: tcp
#         Device: int0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.32/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#          [1690299590.847174] [localhost.i:967330:0]              sys.c:140  UCX  ERROR mremap(oldptr=0x7fb7b7ee0000 oldsize=4096, newsize=8192) failed: Cannot allocate memory
[1690299590.847384] [localhost.i:967330:0]              sys.c:140  UCX  ERROR 

This is an error that I have not seen so far, so we should probably try to clarify where this is coming from.

yosefe commented 1 year ago

@and-1 can you pls run the ucx_info command after setting the following environment variables: export UCX_LOG_LEVEL_TRIGGER=error export UCX_HANDLE_ERRORS=bt This will print a backtrack with the origin of the mmap error. Also, is it possible vm.max_map_count is too small on your system?

and-1 commented 1 year ago

@edgargabriel of course:

env | grep -e ^PATH -e ^LD_LIBRARY_PATH
LD_LIBRARY_PATH=/root/ompi_for_gpu/ompi/lib:/root/ompi_for_gpu/ucx/lib:/opt/rocm/lib
PATH=/root/ompi_for_gpu/ompi/bin:/root/.local/bin:/root/bin:/home/and-1/.local/bin:/home/and-1/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin:/root/ompi_for_gpu/ucx/bin/:/opt/rocm-5.4.3/llvm/bin/

large BAR test executed without errors

./check_large_bar
address buf 0x7fcc9c400000
Buf[0] = -1094795586
Buf[0] = 1

@yosefe setting this envs don't affect ucx_info -d output, result the same as i attached before. Maybe setting UCX_MEM_LOG_LEVEL=debug, UCX_LOG_LEVEL=data help you to understand problem ucx_info.txt

vm.max_map_count has default value

cat /proc/sys/vm/max_map_count
262144
yosefe commented 1 year ago

@and-1 maybe https://github.com/openucx/ucx/pull/8822 can fix the issue?

and-1 commented 1 year ago

Unfortunately no, all the same

edgargabriel commented 1 year ago

Are you executing in a docker container? If yes, maybe you could try the same test using e.g. an Ubuntu 20.04 or 22.04 container?

AlmaLinux is not an officially support distribution by ROCm. I understand based on some webpages that it is supposedly binary compatible to RHEL, but you never know. In addition, if AlmaLinux 9.2 is equivalent to RHEL 9.2, this might also be an issue, since the ROCm 5.4.x series is only validated up to RHEL 9.1

edgargabriel commented 1 year ago

I have some updates on this ticket. I managed to get my hands on a RHEL 9.2 system with ROCm 5.6.0 and ran some tests.

yosefe commented 1 year ago

@edgargabriel can you pls set UCX_LOG_LEVEL_TRIGGER=error, to track the backtrace of the problematic mremap?

edgargabriel commented 1 year ago

@yosefe, the variable didn't provide any insights, but I got a backtrace from a debugger run, it seems that the issue is initiated from the ucm/rocmem . Will need to investigate.

Thread 1 "a.out" hit Breakpoint 1, ucm_sys_realloc (ptr=0x7ffff7a0b008, size=4096) at ../../../src/ucm/util/sys.c:140
140             ucm_error("mremap(oldptr=%p oldsize=%zu, newsize=%zu) failed: %m",
Missing separate debuginfos, use: dnf debuginfo-install comgr-2.5.0.50700-crdb.3325.el9.x86_64 elfutils-libelf-0.188-3.el9.x86_64 glibc-2.34-60.el9.x86_64 hip-runtime-amd-5.6.31101.50700-crdb.3325.el9.x86_64 hsa-rocr-1.9.0.50700-crdb.3325.el9.x86_64 libgcc-11.3.1-4.3.el9.x86_64 libstdc++-11.3.1-4.3.el9.x86_64 libxml2-2.9.13-3.el9_2.1.x86_64 libzstd-1.5.1-2.el9.x86_64 ncurses-libs-6.2-8.20210508.el9.x86_64 numactl-libs-2.0.14-9.el9.x86_64 xz-libs-5.2.5-8.el9_0.x86_64
(gdb) where
#0  ucm_sys_realloc (ptr=0x7ffff7a0b008, size=4096) at ../../../src/ucm/util/sys.c:140
#1  0x00007ffff7f9d3ee in kh_resize_ucm_dl_symbol_hash (h=0x7ffff7fbf318, new_n_buckets=512) at ../../../src/ucm/util/reloc.c:61
#2  0x00007ffff7f9d849 in kh_put_ucm_dl_symbol_hash (h=0x7ffff7fbf318, key=0x7ffff7aa19bc "uct_tcp_cm_send_event", ret=0x7fffffffc840)
    at ../../../src/ucm/util/reloc.c:61
#3  0x00007ffff7f9f108 in ucm_dl_populate_symbols (dl_info=0x7ffff7fbf318, dlpi_addr=140737348481024, table=0x7ffff7aaa500,
    table_size=7728, strtab=0x7ffff7a9f540, symtab=0x7ffff7a9bcd0, dl_name=0x7ffff7fc6860 "/home/taccuser/edgar/UCX/lib/libuct.so.0")
    at ../../../src/ucm/util/reloc.c:281
#4  0x00007ffff7f9f5e5 in ucm_reloc_dl_info_get (phdr_info=0x7fffffffcae0,
    dl_name=0x7ffff7fc6860 "/home/taccuser/edgar/UCX/lib/libuct.so.0", dl_info_p=0x7fffffffca80) at ../../../src/ucm/util/reloc.c:375
#5  0x00007ffff7f9f9f1 in ucm_reloc_phdr_iterator (phdr_info=0x7fffffffcae0, size=64, data=0x7fffffffcb80)
    at ../../../src/ucm/util/reloc.c:465
#6  0x00007ffff79961a4 in dl_iterate_phdr () from /lib64/libc.so.6
#7  0x00007ffff7f9fb44 in ucm_reloc_apply_patch (patch=0x7ffff7fb1e20 <ucm_dlopen_reloc_patches>, libucm_base_addr=0)
    at ../../../src/ucm/util/reloc.c:502
#8  0x00007ffff7fa037b in ucm_reloc_install_dl_hooks () at ../../../src/ucm/util/reloc.c:716
#9  0x00007ffff7fa046d in ucm_reloc_modify (patch=0x7ffff73330c0 <patches>) at ../../../src/ucm/util/reloc.c:748
#10 0x00007ffff7330a3b in ucm_rocmmem_install (events=2097152) at ../../../../src/ucm/rocm/rocmmem.c:156
#11 0x00007ffff7f97123 in ucm_event_install (events=2097152) at ../../../src/ucm/event/event.c:553
#12 0x00007ffff7f971f7 in ucm_set_event_handler (events=2097152, priority=1000, cb=0x7ffff7a5e01a <ucs_rcache_unmapped_callback>,
    arg=0x72ab60) at ../../../src/ucm/event/event.c:596
#13 0x00007ffff7a60323 in ucs_rcache_t_init (self=0x72ab60, _myclass=0x7ffff7a95800 <ucs_rcache_t_class>, _init_count=0x7fffffffcde8,
    params=0x7fffffffce60, name=0x7fffebb959bc "rocm_copy", stats_parent=0x0) at ../../../src/ucs/memory/rcache.c:1266
#14 0x00007ffff7a60672 in ucs_rcache_create (arg0=0x7fffffffce60, arg1=0x7fffebb959bc "rocm_copy", arg2=0x0, obj_p=0x747500)
    at ../../../src/ucs/memory/rcache.c:1326
--Type <RET> for more, q to quit, c to continue without paging--
#15 0x00007fffebb8ec13 in uct_rocm_copy_md_open (component=0x7fffebb9a620 <uct_rocm_copy_component>, md_name=0x7fffffffcf80 "rocm_cpy",
    config=0x73e150, md_p=0x7fffffffcf20) at ../../../../src/uct/rocm/copy/rocm_copy_md.c:439
#16 0x00007ffff7aae5dd in uct_md_open (component=0x7fffebb9a620 <uct_rocm_copy_component>, md_name=0x7fffffffcf80 "rocm_cpy",
    config=0x73e150, md_p=0x73b750) at ../../../src/uct/base/uct_md.c:95
#17 0x00007ffff743240d in ucp_fill_tl_md (context=0x739110, cmpt_index=4 '\004', md_rsc=0x7fffffffcf80, tl_md=0x73b750)
    at ../../../src/ucp/core/ucp_context.c:1293
#18 0x00007ffff7432f20 in ucp_add_component_resources (context=0x739110, cmpt_index=4 '\004', avail_devices=0x7fffffffd110,
    avail_tls=0x7fffffffd0e0, dev_cfg_masks=0x7fffffffd1c0, tl_cfg_mask=0x7fffffffd1b8, config=0x6da650, aux_tls=0x7fffffffd0a0)
    at ../../../src/ucp/core/ucp_context.c:1488
#19 0x00007ffff7433c55 in ucp_fill_resources (context=0x739110, config=0x6da650) at ../../../src/ucp/core/ucp_context.c:1721
#20 0x00007ffff74353ac in ucp_init_version (api_major_version=1, api_minor_version=16, params=0x7fffffffd2e0, config=0x6da650,
    context_p=0x7ffff7f174c0 <ompi_pml_ucx+192>) at ../../../src/ucp/core/ucp_context.c:2165
#21 0x00007ffff7e28bea in mca_pml_ucx_open () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#22 0x00007ffff7b317a2 in mca_base_framework_components_open () from /home/taccuser/edgar/OpenMPI/lib/libopen-pal.so.80
#23 0x00007ffff7e245f7 in mca_pml_base_open () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#24 0x00007ffff7b322e1 in mca_base_framework_open () from /home/taccuser/edgar/OpenMPI/lib/libopen-pal.so.80
#25 0x00007ffff7c95473 in ompi_mpi_instance_init_common () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#26 0x00007ffff7c961a4 in ompi_mpi_instance_init () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#27 0x00007ffff7c88d78 in ompi_mpi_init () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#28 0x00007ffff7cbd33e in PMPI_Init () from /home/taccuser/edgar/OpenMPI/lib/libmpi.so.40
#29 0x0000000000401196 in main ()