Open taozhiwei opened 3 weeks ago
@siddharth9820, can you help review?
At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!
At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this! Thank you for your review,I just updated the mainline code and I need you to perform another test. Thank you.
@siddharth9820 @tjruwase Please help review again when you have free time. Thank you very much.
@taozhiwei still lgtm. Do you have some convergence curves for your changes?
@taozhiwei still lgtm. Do you have some convergence curves for your changes?
I ran a test locally and it still converged
Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.
@siddharth9820 https://github.com/microsoft/DeepSpeed/actions/runs/9605259289/job/26492504473?pr=5626 This test failed due to network issues and needs to be triggered CI test again. Thank you
Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go. How do I put the pictures in? I tried a few times but was unsuccessful.
@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?
@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?
This is a comparison of the loss curve before and after modification, which is consistent.
https://imgur.com/Nhj7c1m
Please help review again @siddharth9820 @tjruwase
Thanks for doing this. LGTM. @tjruwase do we need any other tests?
Thanks for doing this. LGTM. @tjruwase do we need any other tests?
@taozhiwei, thanks for the PR. This is really a great contribution.
@siddharth9820, thanks for helping to review.
Approved.
Thanks for doing this. LGTM. @tjruwase do we need any other tests?
@taozhiwei, thanks for the PR. This is really a great contribution.
@siddharth9820, thanks for helping to review.
Approved.
The first failed test was due to http 429,https://github.com/microsoft/DeepSpeed/actions/runs/9698089296/job/26763816372?pr=5626. The second failed test I tested locally was passed,https://imgur.com/v2eMEox,meanwhile, my RP should not have any impact on this failed test Can you help run the CI test again? thank you! @siddharth9820
@tjruwase or someone else working at Deepspeed might be able to help you with CI
@tjruwase or someone else working at Deepspeed might be able to help you with CI
@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap.
Example: E + M + D parallel world_size = 8 model_degree = 2 expert_degree = 4 mp_group = [0, 1], [2,3], [4,5],[6,7] expert_parallel_group = [0,2,4,6], [1,3,5,7]
The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data. Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data.
After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data.