Closed bigPYJ1151 closed 9 months ago
gentle ping @tarudoodi , is this a known issue for oneCCL?
thanks, -yuan
@zhouyuan Our tests pass with shm
provider. This error usually indicates that the provider failed to initialize. Can you try a simple mpi hello-world on your system with shm
provider and make sure that the environment is set up correctly. fi_info
output should also list shm
provider.
@tarudoodi Thanks for your response! After the investigation, we found the root cause was that oneCCL needs a larger shared memory size. We use a docker container that has 64 MB shared memory but oneCCL sets FI_SHM_RX_SIZE
and FI_SHM_TX_SIZE
as 8196 by default, which is much larger than the default value 1024 in libfabric. By reducing these sizes and increasing shared memory size to 4 GB, the shm
provider worked.
Hi, I am trying to use the shared memory provider of oneCCL. However, there are some problems when enabling it on the benchmark example.
When I run
CCL_LOG_LEVEL=info CCL_ATL_TRANSPORT=ofi CCL_ATL_SHM=1 FI_PROVIDER=shm mpirun -n 2 _install/examples/benchmark/benchmark -i 36 -j off -l allreduce -d bfloat16 -y 1048576,8388608,4096000,160000
, I got the error message:Is there any extra configuration or library is needed to support the SHM provider?