Open mkre opened 3 years ago
Anyone got an idea about this? Pinging @bureddy...
@mkre the first transfer with GPUdirect is expected to high overhead with GPUDirectRDMA because it invloves cuda memory registration with IB HCA. is it possible to reuse the buffer from the application ?
@bureddy I guess that question would be one for the AmgX devs to answer. Do you know of anyone seeing performance benefits from using GPUdirect with AmgX? Should I raise my question on the AmgX issue tracker, or do you know anyone working on AmgX you could ping here (might be a long shot, but given that you are now working for the same company...)?
Describe the bug
We are in the process of evaluating the performance of AmgX on our GPU cluster. AmgX has an optional setting to enable GPUDirect MPI communication. However, it seems like enabling this causes a performance decline instead of an increase, compared to the vanilla implementation using host staging. I added a simple timing instrumentation to these two AmgX functions (one of which is being called depending on the AmgX GPUDirect setting):
Here are the timings of the first 50 invocations of both functions:
It becomes obvious that some invocations of this function are significantly more expensive when using GPUDirect. Specifically, it seems like the first invocation for a given buffer size is very expensive. On the other hand, the fastest invocations for a given buffer size are faster when using GPUDirect compared to vanilla (as expected).
FWIW, I have checked the performance of our Open MPI + UCX stack using
osu_bw
andosu_latency
and it is looking alright:Is there any explanation or remedy for this behavior?
Steps to Reproduce
Setup and versions
2 similar nodes, each with the following setup:
Additional information (depending on the issue)
ucx_info -d
: ucx_info_d.txt