Closed kwen2501 closed 1 week ago
Note: Links to docs will display an error until the docs builds have been completed.
As of commit 51b02217ea0c28ae422a62e9e768cb590746ee22 with merge base 8ccf162453bdcda9d7b6c24e65b101764ab4fadf ():
👉 Rebase onto the `viable/strict` branch to avoid these failures
* [pull / test-gguf-util (macos-14)](https://hud.pytorch.org/pr/pytorch/torchchat/1099#29625455931) ([gh](https://github.com/pytorch/torchchat/actions/runs/10687563902/job/29625455931)) ([trunk failure](https://hud.pytorch.org/pytorch/torchchat/commit/8ccf162453bdcda9d7b6c24e65b101764ab4fadf#29597090542))
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Sequence parallel = Tensor parallel + dividing sequence by
tp_degree
at layer boundaryIn addition to the op parallelism achieved by TP, SP reduces the activation size transferred between pipeline stages by
tp_degree
X.The PR includes 4 parts:
parallelize_module
calls (this part mainly mimics what's used in torchtitan).tp_degree
).input_pos
-- we still need to use the full seq length instead ofx.shape[1]
to create it bcx
could have been cut in SP cases.Test:
torchrun --nproc-per-node 8 dist_run.py
(PP=2, TP=4)