Closed Guodanding closed 7 months ago
I think we ignored the T_ij in the actual compuation. So, we are using B_k = B_ws[i + 1]
as the approximate maximum amount of work for the workers that are not shadowed.
We assume that the shadowed experts have quite evenly distributed across all workers, so the cost is negligible.
// @TiagoMAntunes please correct me if I understand your code incorrectly.
OK. But I think Ignoring T_ij will affect the accuracy, even if the shadowed experts have quite evenly distributed across all workers. For example,
Expert0 needs to be shadow, it of course has quite evenly distributed across all workers (6 tokens every worker). And the max computation for every worker is expert0 GeMM computation. If we ignore T_ij, just use B_k = B_ws[i + 1]
and then lat_new = 3 * comp_time * B_k + 4 * send_feature_time * B_k + shadow_time
, the max comp_time here is expert1 GeMM. Won't it harm?
Additionally, does send_feature_time
ignore that the tokens selecting local expert don't need to be send?
Final question. The shadow expert is send to all other workers. So why send_model_time
don't need to multiply num_workers-1?
Thanks :) !!!
Expert0 needs to be shadow, it of course has quite evenly distributed across all workers (6 tokens every worker). And the max computation for every worker is expert0 GeMM computation. If we ignore T_ij, just use
B_k = B_ws[i + 1]
and thenlat_new = 3 * comp_time * B_k + 4 * send_feature_time * B_k + shadow_time
, the max comp_time here is expert1 GeMM. Won't it harm?
I get your point. We should add the computation time of the shadowed experts to shadow_time
here.
Additionally, does
send_feature_time
ignore that the tokens selecting local expert don't need to be send?
Yes, that is ignored. We assume that there is about 1/world_size
error.
Final question. The shadow expert is send to all other workers. So why
send_model_time
don't need to multiply num_workers-1?
No. A proper broadcast algorithm should do the broadcast to any number of receivers in identical latency.
Got it! Thanks! By the way, what do you think of Tutel (or Megatron-DeepSpeed, use dp+tp+ep in MoE layers). In my opinion, Tutel is better at scalability, as it uses a fixed but searchable parallel solution, while FasterMoE is more elegant and fine-grained, but not good at scalability.(I don't know) Have you done some experiment to compare? Please correct me if I misunderstand someting! :)
Got it! Thanks! By the way, what do you think of Tutel (or Megatron-DeepSpeed, use dp+tp+ep in MoE layers).
The FasterMoE paper (ppopp'22) focuses on optimizing EP. Meanwhile, FastMoE, as an constantly developed system, aims at supporting ep with any hybrid parallel strategy (dp / tp / ep / pp / sp / etc.). See the document for details.
An ATC'23 paper, SmartMoE shows a hybrid parallel system based on FastMoE. It outperforms Tutel / DS MoE.
In terms of scalability, if you are talking about using thousands of GPUs (or even more) to train an MoE model, it is true that simply using EP is not efficient, because a2a is not an efficient way of collective communication. Still, you can find a work named Bagualu that uses more than 100,000 processes to train an MoE model using FastMoE and a hybrid parallel strategy.
Got it! Thanks!
Hello! I have read the FasterMoE paper and source code. But I am confused that where is the implemention of in the shadow policy algorithm:
Thanks!