zeyugao commented 3 months ago

What does this PR do?

In this pr, I made some necessary modification to accelerate to achieve sequence parallel with Deepspeed-Ulysses incorporated with transformers.

For more concrete

I copy the parallel_state.py (mpu) from Megatron-Deepspeed and add several helper functions including

get_sequence_parallel_world_size_or_one  # Get the sequence parallel world size if initialized, otherwise return 1
sequence_parallel_is_enabled  # Check if the model parallel is initialized and the sequence parallel world size is larger than 1

Adjust the multiplier of world size when calculating the train_batch_size.
Pass the mpu argument into deepspeed.initialize when model parallel is initialized
Do not shard the dataset when the sampler is DistributedSampler, which means that the dataset is manually controlled.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@muellerzr @BenjaminBossan @SunMarc

SunMarc commented 3 months ago

Thanks for the PR @zeyugao. We will review this PR asap but just to let you know, we have quite a lot of backlog since @pacman100 left. This might take a bit of time before we review this ! Hope you understand ! If anyone is interested in merging asap, please add a thumbs up !

HuggingFaceDocBuilderDev commented 3 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stas00 commented 2 months ago

BTW, an alternative solution has been just posted here: https://github.com/huggingface/transformers/pull/32305

so perhaps wait before the maintainers decided which version is the most fitting before you invest more time into improving this PR, @zeyugao (I'm not a maintainer)

zeyugao commented 2 months ago

Thank you for commenting and mentioning that alternative PR.

I acknowledge that parallel_state.py may be too large and redundant. I can adjust this PR accordingly if needed. If the alternative PR is chosen, I can help identify any potential mishandled behaviors, as my PR is currently being used internally in training models.

Here are some differences between the two PRs for reference (In my understanding):

Utilization of _SeqAllToAll from DeepSpeed:
- The alternative PR uses the original _SeqAllToAll provided by DeepSpeed. Upon reviewing the DeepSpeed repository, it appears that the bug I encountered while developing my PR has been fixed (see the relevant commit https://github.com/microsoft/DeepSpeed/commit/3bdd187e7186b60929d8ac5af483509b8cb9d00c, a concurrent work). Edit: Seems that it is still not working in my understanding: See https://github.com/microsoft/DeepSpeed/issues/5808
- Previously, I struggled to work around this bug, incorrectly identified the root cause, and employed a more complex reshape workflow to use all_to_all_single. Now we can safely switch to using all_to_all_single for better performance.
Sequence Parallel Initialization:
- There are additional preparations needed to enable sequence parallelism, including how to initialize the sequence parallel group in DeepSpeed. The alternative PR omits these steps and leaves the customization to the user (perhaps I missed it).
- In the alternative PR, groups._get_sequence_parallel_world_size is used, which internally utilizes mpu.get_sequence_parallel_world_size. This approach might be more flexible as the user may not want to use the mpu provided by accelerate.
Batch Size Calculation and Data Preparation:
- Batch size calculation and other dataset/dataloader preparations should be adjusted accordingly. When using sequence parallelism, the data parallel world size is no longer as simple as previously assumed by transformers (i.e., data parallel world size == GPU count).
Hijacking the Sequence Parallel Location:
- When developing my PR, _flash_attention_forward existed in the xxFlashAttention module in each model's modeling file. This required me to hijack each attention forward method individually.
- As transformers have now switched to using a globally defined _flash_attention_forward, it is a much better idea to move the hijacking into that function so that all models can benefit from sequence parallelism. However, we need to correctly handle the length-related variables as there are some assumptions regarding these variables that need to be addressed accordingly.

stas00 commented 2 months ago

That's a wonderfully written followup, Elsa

@samadejacobs, could you please have a look at Elsa's comments above and see if anything still needs to be added to your PR - and if so hopefully you can share the credits with Elsa.

Once thing I for sure agree with Elsa is this is Accelerate and not HF transformers, so we need to support things like the correct world size.

I'm pretty sure the final result should be an amalgamation of these 2 PRs. Since these are 2 different repos - possibly merging Sam's PR first as it's more upstream to Accelerate, and then having Elsa adapt her PR to add the missing parts?

zeyugao commented 2 months ago

One more comment: The accelerate and transformers should be both modified to provide a more out of box usage of sequence parallel: https://github.com/huggingface/transformers/pull/31525 and this one

samadejacobs commented 2 months ago

@zeyugao, thanks for your contribution. Few things to note/recommend:

PR32305 is compatible with recent (latest) refactored flash_attn in transformers models
PR32305 implementation leverages torch.distributed.mesh_device in lieu of MPU/parallel state, thereby avoiding code duplication
Finetuning of Llama example script is available here (you need both HF PR and DS PR)
I recommend Ulysses-specific bug fix/enhancement should be made as PR(s) to DeepSpeed repo
I agree with @stas00 suggestion that PR32305 go first and yours can follow with support for accelerate and other nice enhancements.

stas00 commented 2 months ago

@zeyugao, so I suppose https://github.com/huggingface/transformers/pull/31525 will have to be revisited as well once https://github.com/huggingface/transformers/pull/32305 has been merged.

huggingface / accelerate

Some adjustment for supporting Deepspeed-Ulysses #2877

What does this PR do?

Before submitting

Who can review?