openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 428 forks source link

Question: does ucx support FPGA to AMDGPU (ROCm ) p2p transfer? #9598

Open littlewu2508 opened 10 months ago

littlewu2508 commented 10 months ago

Hello, I'm a researcher wishing to achieve p2p data transfer from FPGA (Xilinx Alveo U50) to an AMDGPU. I read the https://rocm.docs.amd.com/en/latest/how-to/gpu-enabled-mpi.html and find that ucx is probably the direction to look into. Also, there are already implementation for FPGA-Nvidia GPU at https://github.com/RC4ML/FpgaNIC, using https://github.com/NVIDIA/gdrcopy, and I noticed there is rocm_gdr support in ucx documentation.

However, beyond these I can't find more clues about how FPGA-AMDGPU p2p can be implemented. Does anyone know about this?

edgargabriel commented 10 months ago

@littlewu2508 the UCX software stack is not tested the Alveo cards at the moment, and hence it is not officially supported. I would suspect that at least tcp connection should be possible with the Alveo cards, but I cannot give an hints or advice since we do not test this scenario.

As a side note, please note that the rocm_gdr component has been removed from UCX starting from version 1.14.0, since it is not truly required for AMD GPUs. The way AMD GPUs are typically set up with large BAR support allows the CPU to map the entire GPU address space onto the host as well, hence a regular memcpy will work on GPU memory even without gdr_copy. (We might need to check and update the UCX documentation)

littlewu2508 commented 10 months ago

@littlewu2508 the UCX software stack is not tested the Alveo cards at the moment, and hence it is not officially supported. I would suspect that at least tcp connection should be possible with the Alveo cards, but I cannot give an hints or advice since we do not test this scenario.

Thank you very much! I have a question about tcp connection. Since I'm focusing on p2p transfer on PCIe bus, I don't know what the role tcp can play here?

As a side note, please note that the rocm_gdr component has been removed from UCX starting from version 1.14.0, since it is not truly required for AMD GPUs. The way AMD GPUs are typically set up with large BAR support allows the CPU to map the entire GPU address space onto the host as well, hence a regular memcpy will work on GPU memory even without gdr_copy. (We might need to check and update the UCX documentation)

That's very helpful, pointing out another way: memcpy using by via large BAR support. I found https://xilinx.github.io/XRT/2022.1/html/p2p.html and it seems that Alveo cards also support this for P2P (including with thirdparty PCIe device)

edgargabriel commented 10 months ago

Thank you very much! I have a question about tcp connection. Since I'm focusing on p2p transfer on PCIe bus, I don't know what the role tcp can play here?

I unfortunately do not know enough about the Alveo cards. My comment regarding tcp was not related for the GPU to NIC transfer, but for data transfer between two nodes with Alveo cards (not sure whether the Alveo cards support verbs)