huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.76k stars 941 forks source link

Some adjustment for supporting Deepspeed-Ulysses #2877

Open zeyugao opened 3 months ago

zeyugao commented 3 months ago

What does this PR do?

In this pr, I made some necessary modification to accelerate to achieve sequence parallel with Deepspeed-Ulysses incorporated with transformers.

For more concrete

get_sequence_parallel_world_size_or_one  # Get the sequence parallel world size if initialized, otherwise return 1
sequence_parallel_is_enabled  # Check if the model parallel is initialized and the sequence parallel world size is larger than 1

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@muellerzr @BenjaminBossan @SunMarc

SunMarc commented 3 months ago

Thanks for the PR @zeyugao. We will review this PR asap but just to let you know, we have quite a lot of backlog since @pacman100 left. This might take a bit of time before we review this ! Hope you understand ! If anyone is interested in merging asap, please add a thumbs up !

HuggingFaceDocBuilderDev commented 3 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stas00 commented 2 months ago

BTW, an alternative solution has been just posted here: https://github.com/huggingface/transformers/pull/32305

so perhaps wait before the maintainers decided which version is the most fitting before you invest more time into improving this PR, @zeyugao (I'm not a maintainer)

zeyugao commented 2 months ago

Thank you for commenting and mentioning that alternative PR.

I acknowledge that parallel_state.py may be too large and redundant. I can adjust this PR accordingly if needed. If the alternative PR is chosen, I can help identify any potential mishandled behaviors, as my PR is currently being used internally in training models.

Here are some differences between the two PRs for reference (In my understanding):

  1. Utilization of _SeqAllToAll from DeepSpeed:

  2. Sequence Parallel Initialization:

    • There are additional preparations needed to enable sequence parallelism, including how to initialize the sequence parallel group in DeepSpeed. The alternative PR omits these steps and leaves the customization to the user (perhaps I missed it).
    • In the alternative PR, groups._get_sequence_parallel_world_size is used, which internally utilizes mpu.get_sequence_parallel_world_size. This approach might be more flexible as the user may not want to use the mpu provided by accelerate.
  3. Batch Size Calculation and Data Preparation:

    • Batch size calculation and other dataset/dataloader preparations should be adjusted accordingly. When using sequence parallelism, the data parallel world size is no longer as simple as previously assumed by transformers (i.e., data parallel world size == GPU count).
  4. Hijacking the Sequence Parallel Location:

    • When developing my PR, _flash_attention_forward existed in the xxFlashAttention module in each model's modeling file. This required me to hijack each attention forward method individually.
    • As transformers have now switched to using a globally defined _flash_attention_forward, it is a much better idea to move the hijacking into that function so that all models can benefit from sequence parallelism. However, we need to correctly handle the length-related variables as there are some assumptions regarding these variables that need to be addressed accordingly.
stas00 commented 2 months ago

That's a wonderfully written followup, Elsa

@samadejacobs, could you please have a look at Elsa's comments above and see if anything still needs to be added to your PR - and if so hopefully you can share the credits with Elsa.

Once thing I for sure agree with Elsa is this is Accelerate and not HF transformers, so we need to support things like the correct world size.

I'm pretty sure the final result should be an amalgamation of these 2 PRs. Since these are 2 different repos - possibly merging Sam's PR first as it's more upstream to Accelerate, and then having Elsa adapt her PR to add the missing parts?

zeyugao commented 2 months ago

One more comment: The accelerate and transformers should be both modified to provide a more out of box usage of sequence parallel: https://github.com/huggingface/transformers/pull/31525 and this one

samadejacobs commented 2 months ago

@zeyugao, thanks for your contribution. Few things to note/recommend:

  1. PR32305 is compatible with recent (latest) refactored flash_attn in transformers models
  2. PR32305 implementation leverages torch.distributed.mesh_device in lieu of MPU/parallel state, thereby avoiding code duplication
  3. Finetuning of Llama example script is available here (you need both HF PR and DS PR)
  4. I recommend Ulysses-specific bug fix/enhancement should be made as PR(s) to DeepSpeed repo
  5. I agree with @stas00 suggestion that PR32305 go first and yours can follow with support for accelerate and other nice enhancements.
stas00 commented 2 months ago

@zeyugao, so I suppose https://github.com/huggingface/transformers/pull/31525 will have to be revisited as well once https://github.com/huggingface/transformers/pull/32305 has been merged.