reduce all-to-all communication volume when both expert and non-expert are tensor-parallel

taozhiwei commented 3 weeks ago

Example: E + M + D parallel world_size = 8 model_degree = 2 expert_degree = 4 mp_group = [0, 1], [2,3], [4,5],[6,7] expert_parallel_group = [0,2,4,6], [1,3,5,7]

The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data. Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data.

After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data.

non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE -> alltoall -> allgather
both non-expert and expert all use TP:
- the original execution order: alltoall -> exe MOE-> allreduce -> alltoall
- optimized execution order: drop -> alltoall -> allgather -> exe MOE -> drop ->alltoall -> allgather

tjruwase commented 3 weeks ago

@siddharth9820, can you help review?

siddharth9820 commented 3 weeks ago

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this!

taozhiwei commented 2 weeks ago

At a quick glance, this looks good to me. We did use the same "optimized execution order" in Deepspeed-TED (https://dl.acm.org/doi/pdf/10.1145/3577193.3593704), but I think that update got lost in an unmerged branch. Thank you @taozhiwei for implementing this! Thank you for your review,I just updated the mainline code and I need you to perform another test. Thank you.

taozhiwei commented 1 week ago

@siddharth9820 @tjruwase Please help review again when you have free time. Thank you very much.

siddharth9820 commented 1 week ago

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

taozhiwei commented 1 week ago

@taozhiwei still lgtm. Do you have some convergence curves for your changes?

I ran a test locally and it still converged

siddharth9820 commented 1 week ago

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go.

taozhiwei commented 1 week ago

@siddharth9820 https://github.com/microsoft/DeepSpeed/actions/runs/9605259289/job/26492504473?pr=5626 This test failed due to network issues and needs to be triggered CI test again. Thank you

taozhiwei commented 1 week ago

Can you please post the loss curves for a model before and after your changes? If those are identical then this PR should be good to go. How do I put the pictures in? I tried a few times but was unsuccessful.

siddharth9820 commented 1 week ago

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

taozhiwei commented 6 days ago

@taozhiwei you can take a screenshot and paste it here. Or maybe you can upload it to a shared gdrive location and share that with us?

This is a comparison of the loss curve before and after modification, which is consistent.
https://imgur.com/Nhj7c1m Please help review again @siddharth9820 @tjruwase

siddharth9820 commented 4 days ago

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

tjruwase commented 4 days ago

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

taozhiwei commented 3 days ago

Thanks for doing this. LGTM. @tjruwase do we need any other tests?

@taozhiwei, thanks for the PR. This is really a great contribution.

@siddharth9820, thanks for helping to review.

Approved.

The first failed test was due to http 429,https://github.com/microsoft/DeepSpeed/actions/runs/9698089296/job/26763816372?pr=5626. The second failed test I tested locally was passed,https://imgur.com/v2eMEox,meanwhile, my RP should not have any impact on this failed test Can you help run the CI test again? thank you! @siddharth9820

siddharth9820 commented 3 days ago

@tjruwase or someone else working at Deepspeed might be able to help you with CI

tjruwase commented 27 minutes ago

@tjruwase or someone else working at Deepspeed might be able to help you with CI

@taozhiwei, apologies for the delay due to our CI issues. We will merge this asap.

microsoft / DeepSpeed

reduce all-to-all communication volume when both expert and non-expert are tensor-parallel #5626