ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
530 stars 371 forks source link

Proposal: SHM cuda-to-cuda performance improvement #8745

Closed wenduwan closed 1 year ago

wenduwan commented 1 year ago

Is your feature request related to a problem? Please describe. This is a feature request from AWS EFA team to improve the cuda-to-cuda performance in SHM.

Describe the solution you'd like Currently SHM implements cuda-to-cuda copies with cudaMemcpywhenever it's available, i.e. ipc -

static struct smr_pend_entry *smr_progress_ipc(...)
{
        ...
    if (cmd->msg.hdr.op == ofi_op_read_req) {
        hmem_copy_ret = ofi_copy_from_hmem_iov(ptr, cmd->msg.hdr.size,
                    cmd->msg.data.ipc_info.iface,
                    cmd->msg.data.ipc_info.device, iov,
                    iov_count, 0);
    } else {
        hmem_copy_ret = ofi_copy_to_hmem_iov(cmd->msg.data.ipc_info.iface,
                    cmd->msg.data.ipc_info.device, iov,
                    iov_count, 0, ptr, cmd->msg.hdr.size);
    }
        ...
}

Based on EFA team's experiment, cudaMemcpy is slower than gdrcopy(if available) for message sizes up to 3kb(based on p4d.24xlarge platform). But due to the current implementation in SHM, we cannot register both the tx and rx MR using gdrcopy memory handle, because it does not support device-to-device copies.

Currently in EFA we work around this issue by gdrcopying application memory from cuda to host before dispatching it to SHM. But this has become problematic since we are migrating to the Peer API - it is designed to allow EFA to call SHM fi_send* using the application MR directly.

Therefore we are planning to enhance SHM(and later SM2), around the handling of FI_HMEM_CUDA - the idea is to take advantage of gdrcopy over cudaMemcpy for smaller messages. This will roughly incur changes in:

Describe alternatives you've considered

I also considered encapsulating the logic in hmem_cuda, e.g. cuda_copy_from_dev(device, host, dev, size) should be smart enough to choose the faster memcpy op, but the current api does not allow a fast way to detect whether the source and dest are both cuda buffers(which gdrcopy does not support) - or maybe there is - please let me know your thoughts.

Additional context Comparing cuda-to-cuda osu latency, with and without gdrcopy. osu_latency(us)
Size(byte) EFA+SHM w/ gdrcopy EFA+SHM w/o gdrcopy
0 1.45 1.43
1 3.22 15.95
2 4.70 15.78
4 4.70 15.74
8 4.67 15.78
16 3.17 15.73
32 3.16 15.67
64 3.35 15.63
128 4.57 15.70
256 5.37 15.52
512 5.26 15.55
1024 7.93 15.68
2048 11.70 15.87
2560 13.38 15.86
3072 16.54 15.95
3584 15.92 15.93
4096 15.96 16.05
wckzhang commented 1 year ago

ACK

wenduwan commented 1 year ago

Merged PR series:

Pending bugfix