pytorch / torchchat

Run PyTorch LLMs locally on servers, desktop and mobile
BSD 3-Clause "New" or "Revised" License
3.11k stars 193 forks source link

[Distributed] Enable sequence parallel #1099

Closed kwen2501 closed 1 week ago

kwen2501 commented 2 weeks ago

Sequence parallel = Tensor parallel + dividing sequence by tp_degree at layer boundary

In addition to the op parallelism achieved by TP, SP reduces the activation size transferred between pipeline stages by tp_degreeX.

The PR includes 4 parts:

  1. Give SP styles to parallelize_module calls (this part mainly mimics what's used in torchtitan).
  2. Before the attention layer, gather the sequence back.
  3. Adjust example input to pipeline stages (divided by tp_degree).
  4. Adjust the code for creating input_pos -- we still need to use the full seq length instead of x.shape[1] to create it bc x could have been cut in SP cases.

Test: torchrun --nproc-per-node 8 dist_run.py (PP=2, TP=4)

pytorch-bot[bot] commented 2 weeks ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1099

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: You can merge normally! (1 Unrelated Failure)

As of commit 51b02217ea0c28ae422a62e9e768cb590746ee22 with merge base 8ccf162453bdcda9d7b6c24e65b101764ab4fadf (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

* [pull / test-gguf-util (macos-14)](https://hud.pytorch.org/pr/pytorch/torchchat/1099#29625455931) ([gh](https://github.com/pytorch/torchchat/actions/runs/10687563902/job/29625455931)) ([trunk failure](https://hud.pytorch.org/pytorch/torchchat/commit/8ccf162453bdcda9d7b6c24e65b101764ab4fadf#29597090542))

This comment was automatically generated by Dr. CI and updates every 15 minutes.