oneapi-src / oneCCL

oneAPI Collective Communications Library (oneCCL)
https://oneapi-src.github.io/oneCCL
Other
191 stars 67 forks source link

Issue about using shared memory provider #106

Closed bigPYJ1151 closed 9 months ago

bigPYJ1151 commented 9 months ago

Hi, I am trying to use the shared memory provider of oneCCL. However, there are some problems when enabling it on the benchmark example.

When I run CCL_LOG_LEVEL=info CCL_ATL_TRANSPORT=ofi CCL_ATL_SHM=1 FI_PROVIDER=shm mpirun -n 2 _install/examples/benchmark/benchmark -i 36 -j off -l allreduce -d bfloat16 -y 1048576,8388608,4096000,160000, I got the error message:

Abort(2138767) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(189)........: 
MPID_Init(1561)..............: 
MPIDI_OFI_mpi_init_hook(1584): 
open_fabric(2663)............: 
find_provider(2819)..........: OFI fi_getinfo() failed (ofi_init.c:2819:find_provider:No data available)

Is there any extra configuration or library is needed to support the SHM provider?

zhouyuan commented 9 months ago

gentle ping @tarudoodi , is this a known issue for oneCCL?

thanks, -yuan

tarudoodi commented 9 months ago

@zhouyuan Our tests pass with shm provider. This error usually indicates that the provider failed to initialize. Can you try a simple mpi hello-world on your system with shm provider and make sure that the environment is set up correctly. fi_info output should also list shm provider.

bigPYJ1151 commented 9 months ago

@tarudoodi Thanks for your response! After the investigation, we found the root cause was that oneCCL needs a larger shared memory size. We use a docker container that has 64 MB shared memory but oneCCL sets FI_SHM_RX_SIZE and FI_SHM_TX_SIZE as 8196 by default, which is much larger than the default value 1024 in libfabric. By reducing these sizes and increasing shared memory size to 4 GB, the shm provider worked.