[Question] Sequence Parallel

yuanenming commented 9 months ago

Thanks for sharing the awesome repo.

I've been utilizing Accelerate for training LLMs. My current setup involves using Deepspeed Zero-3 for training a 70B parameter LLaMA-2 model, with a sequence length of 4,000, on an infrastructure comprising 8 nodes each equipped with 8 A100 GPUs. This configuration functions smoothly. However, when attempting to increase the sequence length, I encounter Out-of-Memory (OOM) issues.

To my understanding, in the Zero-3 algorithm, each process forms a data parallel group, resulting in substantial activation sizes per GPU, even when the micro-batch size is set to 1.

From what I know, Zero-3 doesn't seem to support Tensor Parallelism (TP) or Pipeline Parallelism (PP). However, I noticed that the DeepSpeed team has introduced a sequence parallel implementation, as detailed here: DeepSpeed Ulysses. This leads me to two questions:

Is it feasible to integrate DeepSpeed's sequence parallelism with Accelerate?
Could you provide any advice or strategies for extending the sequence length in my scenario without encountering OOM issues?

Thank you for your time and assistance. I look forward to any insights or suggestions you might offer.

yuanenming commented 9 months ago

Oh, I already used flash-attn-2.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate

[Question] Sequence Parallel #2161