With https://github.com/openxla/xla/pull/17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend.
Copybara import of the project:
--
14362ea3ef78d810a5e34c03f4a0e4c44915470c by Terry Sun tesun@nvidia.com:
LHS deadlock avoidance
--
3937dc9277d73a5b2c5e167da4b95072904df3e3 by Terry Sun tesun@nvidia.com:
minor fixes
--
30db21f9e2e810527bd1a5ad55aab5362e12a161 by Terry Sun tesun@nvidia.com:
minor fix
Merging this change closes #19026
FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/openxla/xla/pull/19026 from terryysun:terryysun/overlapping_collectives 30db21f9e2e810527bd1a5ad55aab5362e12a161
PR #19026: [NVIDIA GPU] LHS enhancement for multiple collective resources
Imported from GitHub PR https://github.com/openxla/xla/pull/19026
With https://github.com/openxla/xla/pull/17749, we can let LHS schedule for multiple collective resources. There are some cases that two collectives cannot be overlapped. When two collectives on different stream share at least 2 ranks, they can form cyclic dependency because the execution order of NCCL kernels can be different on each rank. This PR refactored LHS to expose the comparator to backend, and enforced above constraint for GPU backend. Copybara import of the project:
-- 14362ea3ef78d810a5e34c03f4a0e4c44915470c by Terry Sun tesun@nvidia.com:
LHS deadlock avoidance
-- 3937dc9277d73a5b2c5e167da4b95072904df3e3 by Terry Sun tesun@nvidia.com:
minor fixes
-- 30db21f9e2e810527bd1a5ad55aab5362e12a161 by Terry Sun tesun@nvidia.com:
minor fix
Merging this change closes #19026
FUTURE_COPYBARA_INTEGRATE_REVIEW=https://github.com/openxla/xla/pull/19026 from terryysun:terryysun/overlapping_collectives 30db21f9e2e810527bd1a5ad55aab5362e12a161