openucx / xccl

Other
22 stars 14 forks source link

Adding the XCCL DPU team, and DPU daemon #106

Open janjust opened 3 years ago

janjust commented 3 years ago

This PR adds the new DPU team as well as a contrib directory with the accompanying DPU daemon app.

This is a first but comprehensive attempt which successfully runs pytorch param-comms benchmark. Tested over 32 bluefield enabled nodes.

There are several configury options to keep in mind when running.

new config options: --with-dpu=yes

client/host side:
two new flags and additional dpu parameter for TLS:
-x TORCH_UCC_TLS=dpu
-x XCCL_TEAM_DPU_ENABLE=1
-x XCCL_TEAM_DPU_HOST_DPU_LIST=

the host_dpu_list file is a 1 to 1 mapping host file that dpu team will use to identify the IP address of his DPU.
eg:
host1 dpu1
host2 dpu2
etc.

dpu side:
-x DPU_DATA_BUFFER_SIZE=$((16 * 1024 * 1024))
En environment variable that sets the buffer size available on the DPU.
If not provided, default is 16MB.
./dpu_server <threads (int)> by default it will use a single thread.

eg.
mpirun -np 4 --map-by ppr:1:node -x UCX_NET_DEVICES=mlx5_0:1 -x XCCL_TEST_TLS=ucx --bind-to none --report-bindings --tag-output -hostfile file.dpus -x LD_LIBRARY_PATH  ./dpu_server 4

Signed-off-by: Tomislavj Janjusic tomislavj@nvidia.com

Co-authored-by: Artem Polyakov artpol84@gmail.com Sergey Lebedev sergeyle@nvidia.com

janjust commented 3 years ago

@manjugv @Sergei-Lebedev @vspetrov Hey guys - this is PR which adds the DPU team, developed during the hackathon by @artpol84 @Sergei-Lebedev and me.

It's the first attempt that successfully runs, but obviously needs strong vetting. We did preliminary data-checks with the xccl allreduce tests, seems to pass - and it successfully runs the pytorch param/comms bench.

artpol84 commented 3 years ago

@janjust please change the commit message as follows:

Co-authored-by: Artem Polyakov <artpol84@gmail.com>
Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>

Per https://docs.github.com/en/free-pro-team@latest/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors

artpol84 commented 3 years ago

I tried it out of curiosity and it works as expected: https://github.com/artpol84/xccl/commit/91a6466ba984109480c51fc7125559fdcc0b97d6

janjust commented 3 years ago

@janjust please change the commit message as follows:

Co-authored-by: Artem Polyakov <artpol84@gmail.com>
Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>

Per https://docs.github.com/en/free-pro-team@latest/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors

done