Definition peer to peer support

wzamazon commented 1 year ago

This came from a discussion https://github.com/ofiwg/libfabric/pull/8529

Background is that application like NCCL need a way to specify libfabric endpoint cannot make calls to CUDA API to support CUDA memory.

@shefty suggested to use the FI_OPT_HMEM_P2P_REQUIRED, which currently states as the following:

FI_HMEM_P2P_REQUIRED: Peer to peer support must be used for transfers, transfers that cannot be performed using p2p will be reported as failing.

From https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html

However, to use this option for the purpose I described, we need a definition of "peer to peer support", which is lacking in the fi_endpoint document. So I opened this issue to ask whether libfabric community to agree on a definition for "peer to peer" support.

One thing I want to mention is that NCCL does allow libfabric to use GDRcopy, see this comment from @jdinan. EFA provider does use GDRcopy when used by NCCL, and found it to be efficient for small messages.

I understand that other providers, like RxM, also want to use GDRcopy to support NCCL.

Therefore, I think it would be ideal if we can define "peer to peer support" in a way that mechanisms like GDRcopy is counted as "peer to peer" support.

shefty commented 1 year ago

Peer to peer is meant to describe PCI peer to peer transfers, or device to device transfers that do not require bouncing data through host buffers. This could also apply to other device buses, not just PCI.

wzamazon commented 1 year ago

I see.

I think for the case of NCCL, HMEM_P2P_REQUIRED is too strong. Basically, it need a way to know whether the provider is capable of P2P, not necessarily that all transfer must be through peer 2 peer.

I am reading the man page for FI_HMEM_P2P_ENABLED. It did not specify what provider should do if it does not support Peer 2 peer.

Would it be reasonable for a provider to return -FI_EOPNOSUPP, if user set FI_HMEM_P2P_ENABLED and the provider is incapable of peer 2 peer support?

shefty commented 1 year ago

Maybe the question is whether HMEM_P2P_REQUIRED is useful? Or is it only useful if it also allows gdrcopy?

Does gdrcopy behave the same as if p2p were used?

wzamazon commented 1 year ago

Maybe the question is whether HMEM_P2P_REQUIRED is useful? Or is it only useful if it also allows gdrcopy?

I think P2P_REQUIRED is still useful, if we define P2P support as NIC access HMEM memory directly.

I can think of at least 1 case that NCCL does not want libfabric to only use NIC to access HMEM memory, (do NOT use gdrcopy), which is when NCCL uses its LL128 protocol.

Does gdrcopy behave the same as if p2p were used?

I do not think so. gdrcopy basically map GPU memory to host's memory address space. then do a memcpy, so it is driven by CPU.

shefty commented 1 year ago

So, it sounds like we need some other option that can be used to query/restrict the type of operations that a provider can undertake. Maybe this is a new HMEM option, or some sort of XPU option. Right now there's no way to convey that P2P is okay, but if you can't use P2P, then only this 'other' mechanism is usable.

That's hard to define generically, however. Maybe it's something like P2P_OR_CPU_ONLY?

shefty commented 1 year ago

From ofiwg call: Keep current FI_HMEM_P2P options restrictive in the definition. May need CUDA specific option. NCCL restricts the use of any CUDA call from any lower layer. Proposal: FI_CUDA_API_ENABLED/ALLOWED/DISABLED/PERMITTED ? Boolean option is sufficient.

wzamazon commented 1 year ago

https://github.com/ofiwg/libfabric/pull/8624 introduced FI_CUDA_API_PERMITTED

shefty commented 1 year ago

Has this issue been resolved with the introduction of FI_CUDA_API_PERMITTED?

wzamazon commented 1 year ago

Yes

ofiwg / libfabric

Definition peer to peer support #8610