RNDV Transfer from Host Memory to CUDA Managed Memory

J-StrawHat commented 3 months ago

Describe the bug

Is it possible to use Rendezvous protocol to transfer data from host memory on a node (without GPUs) to CUDA managed memory on another node (with GPUs)?

Steps to Reproduce

Run codes: examples/ucp_client_server.c

Server (with GPUs)

$ export CUDA_VISIBLE_DEVICES=0
$ ./ucp_client_server -c am -s 2097152 -m cuda-managed
server is listening on IP 0.0.0.0 port 13337
Waiting for connection...
Server received a connection request from client at address 192.168.0.215:52902
error handling callback was invoked with status -25 (Connection reset by remote peer)
unable to receive UCX message (Connection reset by remote peer)
server failed on iteration #1

Client (with GPUs)

$ export CUDA_VISIBLE_DEVICES=""
$ ./ucp_client_server -a 192.168.0.215 -c am -s 2097152
[1719309408.296035] [xfusion5:2129788:0]  proto_reconfig.c:48   UCX  ERROR cannot find remote protocol for: client_server intra-node cfg#3 | rndv_send from host memory to cuda-managed
unable to send UCX message (Request canceled)
client failed on iteration #1
[1719309408.313073] [xfusion5:2129788:0]           mpool.c:54   UCX  WARN  object 0x55c3775b7600 was not returned to mpool ucp_rkeys
[1719309408.329625] [xfusion5:2129788:0]          rcache.c:701  UCX  WARN  ucp_rcache: destroying inuse region 0x55c376c4f7e0 [0x55c3773162c0..0x55c3775162c0] g- rw ref 1 md[4]=mlx5_0

UCX version used (release v1.16.0) + UCX configure flags

$ ucx_info -v
Library version: 1.16.0
Library path: /home/xxx/lib/ucx-1.16.0/install/lib/libucs.so.0
API headers version: 1.16.0
Git branch '', revision 
Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-gtest --enable-examples --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-xpmem --without-java --with-cuda=/usr/local/cuda-11.7 --with-gdrcopy --prefix=/home/xxx/lib/ucx-1.16.0/install

without setting UCX_TLS

Setup and versions

OS version + CPU architecture
- Ubuntu 20.04.6 LTS
- Linux xfusion5 5.4.0-174-generic #193-Ubuntu SMP Thu Mar 7 14:29:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

For RDMA/IB/RoCE related issues:

Driver version:
- MLNX_OFED_LINUX-24.01-0.3.3.1

HW information:

hca_id: mlx5_0
transport:                      InfiniBand (0)
fw_ver:                         16.35.3502
node_guid:                      506b:4b03:0028:4fa8
sys_image_guid:                 506b:4b03:0028:4fa8
vendor_id:                      0x02c9
vendor_part_id:                 4119
hw_ver:                         0x0
board_id:                       MT_0000000011
phys_port_cnt:                  1
port:   1
                state:                  PORT_ACTIVE (4)
                max_mtu:                4096 (5)
                active_mtu:             1024 (3)
                sm_lid:                 0
                port_lid:               0
                port_lmc:               0x00
                link_layer:             Ethernet

For GPU related issues:
- GPU type: Tesla V100-PCIE-32GB
- CUDA Toolkit: cuda_11.7.r11.7/compiler.31442593_0
- NVIDIA Drivers Version: 550.54.15

yosefe commented 3 months ago

@J-StrawHat such asymmetric configuration is currently not supported: the client is not supporting cuda memory so it's not able to figure a right response to the RTR message. We will aim to improve it in further releases.

J-StrawHat commented 3 months ago

Thank you for your response. Additionally, I would like to inquire about the best practices for handling this asymmetric configuration(host memory -> CUDA managed memory) in the current release version. I tested the Stream API and it seems to support this configuration. Any further recommendations or insights would be greatly appreciated.

yosefe commented 3 months ago

I'd suggest trying to set UCX_RNDV_SCHEME=get_zcopy or UCX_RNDV_THRESH=inf

J-StrawHat commented 3 months ago

Thank you so much

yosefe commented 3 months ago

@J-StrawHat just to clarify, did any of the suggestion help, and if yes, which one?

J-StrawHat commented 3 months ago

After conducting several tests, I found that setting UCX_RNDV_SCHEME=get_zcopy still results in the same error. However, setting UCX_RNDV_THRESH=inf allows the program to run correctly. Additionally, compared to the Stream API, it demonstrates lower latency for large data transfers.

openucx / ucx