openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

TL/UCP: Add Sliding Window allreduce implementation #958

Closed nsarka closed 2 weeks ago

nsarka commented 2 months ago

This PR is the second in a series of two that will add Sliding Window allreduce to UCC. This one implements the function stubs left by the first PR (https://github.com/openucx/ucc/pull/902)

swx-jenkins3 commented 2 months ago

Can one of the admins verify this patch?

Sergei-Lebedev commented 2 months ago

ok to test

nsarka commented 1 month ago

@Sergei-Lebedev It seems like the same ucc test is failing as the active_set. However, this one fails because of wrong cuda versions: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

janjust commented 1 month ago

@artemry-nv Seems like we have CI issues on this PR, unrelated to the PR

artemry-nv commented 1 month ago

@B-a-S please take a look

B-a-S commented 1 month ago

@B-a-S please take a look

I've rerun the job. Take a look http://hpc-master.lab.mtl.com:8080/job/ucc/3282/

nsarka commented 1 month ago

The ucc gtest failed on

[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[2024-05-16T18:13:52.173Z] [swx-clx01:196  :0:196] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
nsarka commented 1 month ago

On this PR, I logged into the CI node and opened a shell inside the container that was running this gtest. Running it myself it passed:

swx-jenkins@swx-clx01:/opt/nvidia/src/ucc/build/test/gtest$ stdbuf -e0 -o0 /opt/nvidia/src/ucc/build/test/gtest/gtest --gtest_filter=*test_tl_mlx5_dm.MemcpyToDeviceMemory*
Note: Google Test filter = *test_tl_mlx5_dm.MemcpyToDeviceMemory*
[==========] Running 8 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 8 tests from test_tl_mlx5_dm
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/0 (63 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/1
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/1 (36 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/2
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/2 (35 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/3
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/3 (35 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/4 (38 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/5
[  SKIPPED ] test_tl_mlx5_dm.MemcpyToDeviceMemory/5 (37 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/6
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/6 (48 ms)
[ RUN      ] test_tl_mlx5_dm.MemcpyToDeviceMemory/7
[       OK ] test_tl_mlx5_dm.MemcpyToDeviceMemory/7 (59 ms)
[----------] 8 tests from test_tl_mlx5_dm (351 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (351 ms total)
[  PASSED  ] 5 tests.
[  SKIPPED ] 3 tests.
nsarka commented 1 month ago

I updated the PR. The get/reduce/put phase and the barrier part of the algorithm are now run via schedule. I left the allgather phase the way it was inside of the get/reduce/put phase because once Ferrol's PR goes in, I will be removing the allgather and using that instead for key exchange.

Also, I moved some code to two new files, tl_ucp_dpu_offload.c and tl_ucp_dpu_offload.h. These have common code that will be used for the rest of the collectives we're planning on implementing.

nsarka commented 1 month ago

Please wait to review, there are some failures I should fix first.

artemry-nv commented 1 month ago

bot:retest

nsarka commented 1 month ago

@samnordmann The PR is ready for review, thank you. Please note that the allgather task is still part of the algorithm. Once Ferrol's PR goes in I will convert the code to use that for the import/allgather phase of the algorithm. I also left the reduction as part of the algorithm, since it would involve splitting the gets and puts into tasks of their own as well. There would be thousands of these tasks for large message sizes.

janjust commented 2 weeks ago

@Sergei-Lebedev When you say follow up changes - are you talking in a separate PR after this is merged? If so, I'll open up an issue and tag this so we don't forget.

nsarka commented 2 weeks ago

@Sergei-Lebedev @samnordmann The PR is ready to be merged