Is your feature request related to a problem? Please describe.
This is a feature request from AWS EFA team to improve the cuda-to-cuda performance in SHM.
Describe the solution you'd like
Currently SHM implements cuda-to-cuda copies with cudaMemcpywhenever it's available, i.e. ipc -
Based on EFA team's experiment, cudaMemcpy is slower than gdrcopy(if available) for message sizes up to 3kb(based on p4d.24xlarge platform). But due to the current implementation in SHM, we cannot register both the tx and rx MR using gdrcopy memory handle, because it does not support device-to-device copies.
Currently in EFA we work around this issue by gdrcopying application memory from cuda to host before dispatching it to SHM. But this has become problematic since we are migrating to the Peer API - it is designed to allow EFA to call SHM fi_send* using the application MR directly.
Therefore we are planning to enhance SHM(and later SM2), around the handling of FI_HMEM_CUDA - the idea is to take advantage of gdrcopy over cudaMemcpy for smaller messages. This will roughly incur changes in:
Additional logic to determine if struct ofi_mr->device is a gdr memory handle
De-select ipc when gdr memory handle is present and message is smaller than X bytes(going to start with 3kb, see table below), e.g.
static ssize_t smr_generic_sendmsg(...)
{ ...
/* Do not inline/inject if IPC is available so device to device
* transfer may occur if possible. - unless gdrcopy should be used */
if (iov_count == 1 && desc && desc[0]) {
if (iface == FI_HMEM_CUDA && is_gdr_mem_handle(device) && total_len < some_size) {
use_ipc = false;
} else {
smr_desc = (struct ofi_mr *) *desc;
use_ipc = ofi_hmem_is_ipc_enabled(((struct ofi_mr *) *desc)->iface) &&
smr_desc->flags & FI_HMEM_DEVICE_ONLY &&
!(op_flags & FI_INJECT);
}
}
proto = smr_select_proto(use_ipc, smr_cma_enabled(ep, peer_smr), op,
total_len, op_flags);
...
}
Prevent gdr memory handle from being used in device-to-device copies by overriding the device value in, e.g. ofi_copy_from_hmem_iov
Describe alternatives you've considered
I also considered encapsulating the logic in hmem_cuda, e.g. cuda_copy_from_dev(device, host, dev, size) should be smart enough to choose the faster memcpy op, but the current api does not allow a fast way to detect whether the source and dest are both cuda buffers(which gdrcopy does not support) - or maybe there is - please let me know your thoughts.
Additional context
Comparing cuda-to-cuda osu latency, with and without gdrcopy.
Is your feature request related to a problem? Please describe. This is a feature request from AWS EFA team to improve the cuda-to-cuda performance in SHM.
Describe the solution you'd like Currently SHM implements cuda-to-cuda copies with
cudaMemcpy
whenever it's available, i.e. ipc -Based on EFA team's experiment,
cudaMemcpy
is slower than gdrcopy(if available) for message sizes up to 3kb(based on p4d.24xlarge platform). But due to the current implementation in SHM, we cannot register both the tx and rx MR using gdrcopy memory handle, because it does not support device-to-device copies.Currently in EFA we work around this issue by gdrcopying application memory from cuda to host before dispatching it to SHM. But this has become problematic since we are migrating to the Peer API - it is designed to allow EFA to call SHM
fi_send*
using the application MR directly.Therefore we are planning to enhance SHM(and later SM2), around the handling of
FI_HMEM_CUDA
- the idea is to take advantage of gdrcopy overcudaMemcpy
for smaller messages. This will roughly incur changes in:struct ofi_mr->device
is a gdr memory handledevice
value in, e.g.ofi_copy_from_hmem_iov
Describe alternatives you've considered
I also considered encapsulating the logic in hmem_cuda, e.g.
cuda_copy_from_dev(device, host, dev, size)
should be smart enough to choose the faster memcpy op, but the current api does not allow a fast way to detect whether the source and dest are both cuda buffers(which gdrcopy does not support) - or maybe there is - please let me know your thoughts.