openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.11k stars 417 forks source link

How to change single copy VIA xpmem execution to the sender process #10019

Open arun-chandran-edarath opened 1 month ago

arun-chandran-edarath commented 1 month ago

Hi Everyone,

@yosefe @tvegas1

I am currently examining the execution of MPI_Send (Blocking send) with UCX in an intra_node scenario. At present, the memory transfer (ucs_memcpy_relaxed()) is executed in the receiver process (rank or processor), as depicted below.

reciver_process_ntbt

By executing the same in the sender process, as shown below, we could significantly reduce cache-to-cache data transfers and conserve memory bandwidth.

sender_process_ntbt

However, I am struggling to find a runtime configuration that would allow me to execute this transfer in the sender process with the hint UCS_ARCH_MEMCPY_NT_DEST and benchmark it. Could anyone provide some guidance or suggestions on this matter?

Thank you in advance for your assistance.

--Arun

yosefe commented 1 month ago

Currently rkey_ptr protocol always does memcpy on the receiver. In order to do memcpy on the sender would need to implement a new variant of this protocol (with extra control message)

tvegas1 commented 1 month ago

@arun-chandran-edarath, in case you would want more details, without much thinking and unsure about perf result, it might be possible to to implement as PoC either at:

arun-chandran-edarath commented 1 month ago

@yosefe and @tvegas1,

Thank you for your responses. I would like to clarify if the two suggestions provided are identical:

a) Implementing a new variant of the rkey_ptr protocol (with an extra control message) to perform memcpy on the sender. b) Using an rndv rtr flow with sm/mm put primitives in UCP.

Could you please provide more specific details or elaborate on these suggestions? Additionally, it would be helpful if you could point me towards the relevant source code files or any examples that I could refer to.

--Arun