[Distributed] Enable sequence parallel

pytorch / torchchat

Run PyTorch LLMs locally on servers, desktop and mobile

BSD 3-Clause "New" or "Revised" License

3.11k stars 193 forks source link

Sequence parallel = Tensor parallel + dividing sequence by tp_degree at layer boundary

In addition to the op parallelism achieved by TP, SP reduces the activation size transferred between pipeline stages by tp_degreeX.

The PR includes 4 parts:

Give SP styles to parallelize_module calls (this part mainly mimics what's used in torchtitan).
Before the attention layer, gather the sequence back.
Adjust example input to pipeline stages (divided by tp_degree).
Adjust the code for creating input_pos -- we still need to use the full seq length instead of x.shape[1] to create it bc x could have been cut in SP cases.

Test: torchrun --nproc-per-node 8 dist_run.py (PP=2, TP=4)

:white_check_mark: You can merge normally! (1 Unrelated Failure)

As of commit 51b02217ea0c28ae422a62e9e768cb590746ee22 with merge base 8ccf162453bdcda9d7b6c24e65b101764ab4fadf ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

* [pull / test-gguf-util (macos-14)](https://hud.pytorch.org/pr/pytorch/torchchat/1099#29625455931) ([gh](https://github.com/pytorch/torchchat/actions/runs/10687563902/job/29625455931)) ([trunk failure](https://hud.pytorch.org/pytorch/torchchat/commit/8ccf162453bdcda9d7b6c24e65b101764ab4fadf#29597090542))

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch / torchchat

[Distributed] Enable sequence parallel #1099

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1099

:white_check_mark: You can merge normally! (1 Unrelated Failure)