Enhanced commitment for implementing FI_OPT_CUDA_API_PERMITTED

wzamazon commented 1 year ago

Is your feature request related to a problem? Please describe. libfabric user need stronger commitment from libfabric providers about implementing FI_OPT_CUDA_API_PERMITTED

Describe the solution you'd like This flag FI_OPT_CUDA_API_PERMITTED, which was introduced for middleware like NCCL to disable CUDA API call.

This issue propose that libfabric have a stronger commitment for the flag.

Specifically, I propose to add

"Any libfabric provider that claim support of FI_HMEM is guaranteed to implement this option"

to the document of this flag.

This is because the information from this flag is critical for NCCL, and NCCL absolutely need the information. If a provider support FI_HMEM but does not implement this option, NCCL does not know how to proceed.

Additional context I looked into the code. It seems that there are 4 providers that support FI_HMEM: shm, efa, verbs and rxm.

I have a [PR]( implement this option for EFA.

SHM should be straight forward. I can put up a PR for that too.

I can look into RxM and Verbs too (though help is appreciated).

wzamazon commented 1 year ago

I opened https://github.com/ofiwg/libfabric/pull/8633 to implement this option for all 4 providers.

The PR also update the document to enhance the commitment.

wzamazon commented 1 year ago

PR was merged

ofiwg / libfabric

Enhanced commitment for implementing FI_OPT_CUDA_API_PERMITTED #8639