Open kmehant opened 1 month ago
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@kmehant if you rebase from main
this should fix the failures (tl;dr we had py 3.8 EOL)
@muellerzr Appreciate your response. I would like to bring to your notice the below two points.
For point (1) I can keep this PR simple and allow only for the paradigm 1 and address the paradigm 2 in another PR. For point (2) I can remove application of TP part from this PR, keeping this simple and independent. The part removed can be added in a separate PR as point (2)(i) is completed.
WDYT?
@muellerzr can I work on this https://github.com/huggingface/accelerate/pull/3173#pullrequestreview-2401793359 in a separate PR?
I have fetched and rebased my PR and addressed all the review comments thank you.
This feature is really useful, thank you @kmehant. I wonder if it is possible to combine tensor parallel with data parallel after this PR, say, TP for same-node parallelism and DP for multi-node parallelism.
This feature is really useful, thank you @kmehant. I wonder if it is possible to combine tensor parallel with data parallel after this PR, say, TP for same-node parallelism and DP for multi-node parallelism.
Hi @HoangCongDuc, support for that is in my TODOs but not covered in this PR, should be coming soon after discussing with HF. Thank you.
What does this PR do?
TorchTensorParallelPlugin
to support TP with Pytorch 2.0. This work should be seen along with the PR https://github.com/huggingface/transformers/pull/34194.Please review in conjunction with https://github.com/huggingface/transformers/pull/34194
Results
See significant improvement in both memory and throughput compared against single gpu training, and FSDP across different settings (checkpointing on/off) and context lengths.
Done on two models
Tables below show the max cuda memory and throughput for various configurations showing the potential of TP contributed in this PR. There is gains in both memory and throughput.
Note: Please be aware that the effective TPS for FSDP would be multiplicative of the parallel factor (number of GPUs/devices engaged in distributed training) whereas that is not the case with TP. Therefore, when effective throughput is considered we can find FSDP is better than TP in terms of throughput. However, that may be compensated by increasing the batch size utilizing the memory gains etc.
Fixes # (issue) https://github.com/huggingface/transformers/issues/32470
Before submitting
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
I have cycles to bring in more improvements over this PR to bring in Pytorch TP support to HF. Looking forward. Thank you