openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
195 stars 96 forks source link

TL/UCP: Add support for active_set broadcast with knomial and dbt for size greater than 2 #926

Closed nsarka closed 4 months ago

nsarka commented 7 months ago

Active set is a sort of lightweight subcommunicator for TL/UCP and TL/NCCL. It was originally used as a hack that enables point to point communication. This PR:

swx-jenkins3 commented 7 months ago

Can one of the admins verify this patch?

Sergei-Lebedev commented 7 months ago

ok to test

manjugv commented 6 months ago

@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast?

nsarka commented 6 months ago

@nsarka What is the use case for enabling this? If removing the restriction - why only limit to broadcast?

This was requested by Almog from the cuBLAS team. I'm not sure the details after that. @Sergei-Lebedev do you know what Almog wanted with this?

almogsegal commented 6 months ago

The use-case for that is to be able perform a "broadcast" on row/col comms without having to create these comms. For linear algebra functionality, when you use 2DBC data layout, it simplifies the code and improves performance to use row/col communicators. The problem is that NCCL comm creation is very expensive. That way, we can use the active_set broadcast and the stride to perform "row/col bcast" without paying the comm split price while also enjoying the ncclGroupStart/End functionality that is not exposed through UCC.

janjust commented 5 months ago

@manjugv @Sergei-Lebedev @samnordmann Tested on @almogsegal container - works as intended

janjust commented 5 months ago

bot:retest

Sergei-Lebedev commented 4 months ago

bot:retest

Sergei-Lebedev commented 4 months ago

@nsarka can you please check why gtest failed in CI?

nsarka commented 4 months ago

@nsarka can you please check why gtest failed in CI?

@Sergei-Lebedev it seems like the gtest passed. However, the ucc test failed with Cancelling nested steps due to timeout. It looks like it ran for 2 hours, passed building, codestyle, and then hangs for 4 hours in UCC / Torch-UCC test until the 6 hour timeout.

janjust commented 4 months ago

@Sergei-Lebedev , it seems we're facing the same issue (something with containers) on several PRs

Sergei-Lebedev commented 4 months ago

@janjust I think we see two different problems. Here it's

[2024-05-07T10:35:28.799Z] [----------] 3 tests from test_context
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.create_destroy
[2024-05-07T10:35:28.799Z] [       OK ] test_context.create_destroy (236 ms)
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.init_multiple
[2024-05-07T10:35:29.727Z] [       OK ] test_context.init_multiple (983 ms)
[2024-05-07T10:35:29.727Z] [ RUN      ] test_context.global
[2024-05-07T11:13:33.528Z]  For more details, please look at: 
[2024-05-07T11:13:33.528Z]     /scrap/jenkins/workspace/ucc/.ci/scripts/cov-build/build-log.txt
[2024-05-07T11:13:33.528Z] Running Coverity analysis...
[2024-05-07T11:13:33.528Z] Coverity Static Analysis version 2021.12.1 on Linux 5.4.0-107-generic x86_64
[2024-05-07T11:13:33.528Z] Internal version numbers: 5269ff0e19 p-2021.12-push-625
[2024-05-07T11:13:33.528Z] 

Note that test_context.global test hangs. In another PR it's different (probably CI issue)

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown
nsarka commented 4 months ago

@janjust I think we see two different problems. Here it's

[2024-05-07T10:35:28.799Z] [----------] 3 tests from test_context
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.create_destroy
[2024-05-07T10:35:28.799Z] [       OK ] test_context.create_destroy (236 ms)
[2024-05-07T10:35:28.799Z] [ RUN      ] test_context.init_multiple
[2024-05-07T10:35:29.727Z] [       OK ] test_context.init_multiple (983 ms)
[2024-05-07T10:35:29.727Z] [ RUN      ] test_context.global
[2024-05-07T11:13:33.528Z]  For more details, please look at: 
[2024-05-07T11:13:33.528Z]     /scrap/jenkins/workspace/ucc/.ci/scripts/cov-build/build-log.txt
[2024-05-07T11:13:33.528Z] Running Coverity analysis...
[2024-05-07T11:13:33.528Z] Coverity Static Analysis version 2021.12.1 on Linux 5.4.0-107-generic x86_64
[2024-05-07T11:13:33.528Z] Internal version numbers: 5269ff0e19 p-2021.12-push-625
[2024-05-07T11:13:33.528Z] 

Note that test_context.global test hangs. In another PR it's different (probably CI issue)

nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown

Hi Sergey, in the other PR (https://github.com/openucx/ucc/pull/958), @B-a-S fixed and reran the failing pipeline. He posted a link to the new log (http://hpc-master.lab.mtl.com:8080/job/ucc/3282/) which is hanging in the same test_context.global test this PR is hanging.

nsarka commented 4 months ago

@Sergei-Lebedev I ran the hanging gtest manually with my nsarka/active-set-broadcast branch and it passed, which suggests this is a CI issue:

[ RUN      ] test_context.global
[       OK ] test_context.global (2061 ms)
nsarka commented 4 months ago

@Sergei-Lebedev Hey Sergey, I updated the PR so that the root is in terms of the ucc_team only if the active_set flag was set on coll_args. This fixed the hang in tl/mlx5 in the ucc_service_coll_bcast used for broadcasting protection domain details, which hardcodes the root to 0 in terms of the node-level subset instead of whatever that rank is in the ucc team. tl/sharp and I think some others do this too.