oneAPI Collective Communications Library (oneCCL)

oneAPI Collective Communications Library (oneCCL) provides an efficient implementation of communication patterns used in deep learning.

oneCCL is integrated into:

Horovod* (distributed training framework). Refer to Horovod with oneCCL for details.
PyTorch* (machine learning framework). Refer to PyTorch bindings for oneCCL for details.

oneCCL is governed by the UXL Foundation and is an implementation of the oneAPI specification.

Prerequisites
Installation
Usage
Additional Resources
- Blog Posts
- Workshop Materials

Prerequisites

Ubuntu* 18
GNU*: C, C++ 4.8.5 or higher.

Refer to System Requirements for more details.

SYCL support

Intel(R) oneAPI DPC++/C++ Compiler with Level Zero v1.0 support.

To install Level Zero, refer to the instructions in Intel(R) Graphics Compute Runtime repository or to the installation guide for oneAPI users.

BF16 support

AVX512F-based implementation requires GCC 4.9 or higher.
AVX512_BF16-based implementation requires GCC 10.0 or higher and GNU binutils 2.33 or higher.

Installation

General installation scenario:

cd oneccl
mkdir build
cd build
cmake ..
make -j install

If you need a clean build, create a new build directory and invoke cmake within it.

You can also do the following during installation:

Usage

Launching Example Application

Use the command:

$ source <install_dir>/env/setvars.sh
$ mpirun -n 2 <install_dir>/examples/benchmark/benchmark

Using external mpi

The ccl-bundled-mpi flag in vars.sh can take values "yes" or "no" to control if bundled Intel MPI should be used or not. Current default is "yes", which means that oneCCL temporarily overrides the mpi implementation in use.

In order to suppress the behavior and use user-supplied or system-default mpi use the following command instead of sourcing setvars.sh:

$ source <install_dir>/env/vars.sh --ccl-bundled-mpi=no

The mpi implementation will not be overridden. Please note that, in this case, user needs to assure the system finds all required mpi-related binaries.

Setting workers affinity

There are two ways to set worker threads (workers) affinity: automatically and explicitly.

Automatic setup

Set the CCL_WORKER_COUNT environment variable with the desired number of workers per process.
Set the CCL_WORKER_AFFINITY environment variable with the value auto.

Example:

export CCL_WORKER_COUNT=4
export CCL_WORKER_AFFINITY=auto

With the variables above, oneCCL will create four workers per process and the pinning will depend from process launcher.

If an application has been launched using mpirun that is provided by oneCCL distribution package then workers will be automatically pinned to the last four cores available for the launched process. The exact IDs of CPU cores can be controlled by mpirun parameters.

Otherwise, workers will be automatically pinned to the last four cores available on the node.

Explicit setup

Set the CCL_WORKER_COUNT environment variable with the desired number of workers per process.
Set the CCL_WORKER_AFFINITY environment variable with the IDs of cores to pin local workers.

Example:

export CCL_WORKER_COUNT=4
export CCL_WORKER_AFFINITY=3,4,5,6

With the variables above, oneCCL will create four workers per process and pin them to the cores with the IDs of 3, 4, 5, and 6 respectively.

Using oneCCL package from CMake

oneCCLConfig.cmake and oneCCLConfigVersion.cmake are included into oneCCL distribution.

With these files, you can integrate oneCCL into a user project with the find_package command. Successful invocation of find_package(oneCCL <options>) creates imported target oneCCL that can be passed to the target_link_libraries command.

For example:

project(Foo)
add_executable(foo foo.cpp)

# Search for oneCCL
find_package(oneCCL REQUIRED)

# Connect oneCCL to foo
target_link_libraries(foo oneCCL)

oneCCLConfig files generation

To generate oneCCLConfig files for oneCCL package, use the provided cmake/scripts/config_generation.cmake file:

cmake [-DOUTPUT_DIR=<output_dir>] -P cmake/script/config_generation.cmake

OS File Descriptors

oneCCL uses Level Zero IPC handles so that a process can access a memory allocation done by a different process. However, these IPC handles consume OS File Descriptors (FDs). As a result, to avoid running out of OS FDs, we recommend to increase the default limit of FDs in the system for applications running with oneCCL and GPU buffers.

The number of FDs required is application-dependent, but the recommended limit is 1048575. This value can be modified with the ulimit command.

Governance

The oneCCL project is governed by the UXL Foundation and you can get involved in this project in multiple ways. It is possible to join the Special Interest Groups (SIG) meetings where the group discuss and demonstrates work using the foundation projects. Members can also join the Open Source and Specification Working Group meetings.

You can also join the mailing lists for the UXL Foundation to be informed of when meetings are happening and receive the latest information and discussions.

Additional Resources

Blog Posts

Workshop Materials

oneAPI, oneCCL and OFI: Path to Heterogeneous Architecure Programming with Scalable Collective Communications: recording and slides

Contribute

See CONTRIBUTING for more information.

License

Distributed under the Apache License 2.0 license. See LICENSE for more information.

Security Policy

See SECURITY for more information.

oneapi-src / oneCCL

readme