huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.59k stars 917 forks source link

[Question] Sequence Parallel #2161

Closed yuanenming closed 8 months ago

yuanenming commented 9 months ago

Thanks for sharing the awesome repo.

I've been utilizing Accelerate for training LLMs. My current setup involves using Deepspeed Zero-3 for training a 70B parameter LLaMA-2 model, with a sequence length of 4,000, on an infrastructure comprising 8 nodes each equipped with 8 A100 GPUs. This configuration functions smoothly. However, when attempting to increase the sequence length, I encounter Out-of-Memory (OOM) issues.

To my understanding, in the Zero-3 algorithm, each process forms a data parallel group, resulting in substantial activation sizes per GPU, even when the micro-batch size is set to 1.

From what I know, Zero-3 doesn't seem to support Tensor Parallelism (TP) or Pipeline Parallelism (PP). However, I noticed that the DeepSpeed team has introduced a sequence parallel implementation, as detailed here: DeepSpeed Ulysses. This leads me to two questions:

Thank you for your time and assistance. I look forward to any insights or suggestions you might offer.

yuanenming commented 9 months ago

Oh, I already used flash-attn-2.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.