Proposal: SHM cuda-to-cuda performance improvement

wenduwan commented 1 year ago

Is your feature request related to a problem? Please describe. This is a feature request from AWS EFA team to improve the cuda-to-cuda performance in SHM.

Describe the solution you'd like Currently SHM implements cuda-to-cuda copies with cudaMemcpywhenever it's available, i.e. ipc -

static struct smr_pend_entry *smr_progress_ipc(...)
{
        ...
    if (cmd->msg.hdr.op == ofi_op_read_req) {
        hmem_copy_ret = ofi_copy_from_hmem_iov(ptr, cmd->msg.hdr.size,
                    cmd->msg.data.ipc_info.iface,
                    cmd->msg.data.ipc_info.device, iov,
                    iov_count, 0);
    } else {
        hmem_copy_ret = ofi_copy_to_hmem_iov(cmd->msg.data.ipc_info.iface,
                    cmd->msg.data.ipc_info.device, iov,
                    iov_count, 0, ptr, cmd->msg.hdr.size);
    }
        ...
}

Based on EFA team's experiment, cudaMemcpy is slower than gdrcopy(if available) for message sizes up to 3kb(based on p4d.24xlarge platform). But due to the current implementation in SHM, we cannot register both the tx and rx MR using gdrcopy memory handle, because it does not support device-to-device copies.

Currently in EFA we work around this issue by gdrcopying application memory from cuda to host before dispatching it to SHM. But this has become problematic since we are migrating to the Peer API - it is designed to allow EFA to call SHM fi_send* using the application MR directly.

Therefore we are planning to enhance SHM(and later SM2), around the handling of FI_HMEM_CUDA - the idea is to take advantage of gdrcopy over cudaMemcpy for smaller messages. This will roughly incur changes in:

Additional logic to determine if struct ofi_mr->device is a gdr memory handle

De-select ipc when gdr memory handle is present and message is smaller than X bytes(going to start with 3kb, see table below), e.g.

static ssize_t smr_generic_sendmsg(...)
{       ...
/* Do not inline/inject if IPC is available so device to device
 * transfer may occur if possible. - unless gdrcopy should be used */
if (iov_count == 1 && desc && desc[0]) {
            if (iface == FI_HMEM_CUDA && is_gdr_mem_handle(device) && total_len < some_size) {
                   use_ipc = false;
            } else {
           smr_desc = (struct ofi_mr *) *desc;
           use_ipc = ofi_hmem_is_ipc_enabled(((struct ofi_mr *) *desc)->iface) &&
            smr_desc->flags & FI_HMEM_DEVICE_ONLY &&
            !(op_flags & FI_INJECT);
            }
}
proto = smr_select_proto(use_ipc, smr_cma_enabled(ep, peer_smr), op,
             total_len, op_flags);
    ...
}

Prevent gdr memory handle from being used in device-to-device copies by overriding the device value in, e.g. ofi_copy_from_hmem_iov

Describe alternatives you've considered

I also considered encapsulating the logic in hmem_cuda, e.g. cuda_copy_from_dev(device, host, dev, size) should be smart enough to choose the faster memcpy op, but the current api does not allow a fast way to detect whether the source and dest are both cuda buffers(which gdrcopy does not support) - or maybe there is - please let me know your thoughts.

Additional context Comparing cuda-to-cuda osu latency, with and without gdrcopy.	osu_latency(us)
Size(byte)	EFA+SHM w/ gdrcopy	EFA+SHM w/o gdrcopy
0	1.45	1.43
1	3.22	15.95
2	4.70	15.78
4	4.70	15.74
8	4.67	15.78
16	3.17	15.73
32	3.16	15.67
64	3.35	15.63
128	4.57	15.70
256	5.37	15.52
512	5.26	15.55
1024	7.93	15.68
2048	11.70	15.87
2560	13.38	15.86
3072	16.54	15.95
3584	15.92	15.93
4096	15.96	16.05

wckzhang commented 1 year ago

ACK

wenduwan commented 1 year ago

Merged PR series:

Pending bugfix

https://github.com/ofiwg/libfabric/pull/8902

ofiwg / libfabric

Proposal: SHM cuda-to-cuda performance improvement #8745