huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

[feature] Use Unified Sequence Parallel (USP) instead of Ring attention #226

Open feifeibear opened 3 weeks ago

feifeibear commented 3 weeks ago

In your roadmap, you mentioned the planning for sequence parallelism, specifically the intention to implement ring-attention as part of sequence parallelism. I suggest you consider implementing Unified Sequence Parallel (USP), which combines Ulysses and Ring into a 2D sequence parallelism approach. USP offers better performance compared to using Ring or Ulysses alone.

The code we developed has been widely applied in large language models (LLM) and DiT long sequence training and inference scenarios. You can check our code at the following link:

https://github.com/feifeibear/long-context-attention

For a detailed technical report, please refer to:

https://arxiv.org/abs/2405.07719

I hope this information is helpful to you, and I look forward to your team considering this suggestion.