oneapi-src / oneCCL

oneAPI Collective Communications Library (oneCCL)
https://oneapi-src.github.io/oneCCL
Other
193 stars 70 forks source link

Fixes for using OFI shared memory provider #79

Closed vsanjeep closed 1 year ago

vsanjeep commented 2 years ago

OneCCL has the capability to use two end points,a shared memory provider endpoint for intra node and a network provider endpoint for internode.

The use of shared memory provider is enabled by environment variable CCL_ATL_SHM. However, two bugs in the code base does not allow the use of OFI shm provider.

  1. CCL_ATL_SHM setting is not used during OFI transport initialization and shm provider for intra node communication remains disabled (see atl_ofi::init)
  2. OFI shm provider initiates a connection request on first message/rma while returning retry status. It expects progress before retry to complete connection request. OneCCL retry routine performs poll before retry to progress. However, under certain conditions, the progress type is set as ATL_PROGRESS_CHECK (see alt_ofi::init). The poll method does not progress when type is CHECK. Hence the shm provider keeps sending retry.

The PR has the fixes for the two bugs.

nusislam commented 2 years ago

@vsanjeep - Please create a PR against the internal oneCCL master so we can run it through our CI.