laekov / fastmoe

A fast MoE impl for PyTorch
https://fastmoe.ai
Apache License 2.0
1.57k stars 189 forks source link

FasterMoE Shadow Policy: Detailed Inquiry #204

Closed Guodanding closed 7 months ago

Guodanding commented 7 months ago

Hello! I have read the FasterMoE paper and source code. But I am confused that where is the implemention of image in the shadow policy algorithm:

def global_policy(local_expert_count, _gec, num_expert, world_size):
    r"""
    This is the policy for two-layer MLPs, using the formula in the PPoPP paper.
    A few parameters are used in this policy.
    * `d_model`: feature length of the MLP input and output.
    * `alpha`: the ratio of the MLP's hidden size to `d_model`.
    * `bw_net`: bandwidth of the network (GBps)
    * `bw_mm`: computation throughput of performing GeMM (FLOPs)
    """
    bw_net = float_from_env('FMOE_FASTER_GLBPLC_NETBW', 50 * 1e9 / 8)
    bw_mm = float_from_env('FMOE_FASTER_GLBPLC_GPUTP', 11.5e12)
    alpha = float_from_env('FMOE_FASTER_GLBPLC_ALPHA', 2)
    d_model = float_from_env('FMOE_FASTER_GLBPLC_DMODEL', 2048)

    moe_group = get_moe_group()
    local_expert_count = local_expert_count.cuda()
    agecs = [torch.empty_like(local_expert_count) for _ in range(world_size)]
    dist.all_gather(agecs, local_expert_count, group=moe_group)
    all_global_expert_count = torch.stack(agecs)

    # TODO: data type other than float
    data_size = 4 

    fwd_expert_counts = all_global_expert_count.sum(1).cpu()
    B_ws, indices = fwd_expert_counts.flatten().sort(0, descending=True)

    alphaH2 = alpha * (d_model ** 2)
    B_w = B_ws[0]

    comm = float('+inf')
    send_feature_time = d_model * data_size / bw_net
    send_model_time = 2 * alphaH2 * data_size / bw_net
    comp_time = 4 * alphaH2 / bw_mm
    lat_base = 3 * comp_time * B_w + 4 * send_feature_time * B_w

    res = torch.zeros(world_size * num_expert, dtype=torch.bool)
    shadow_time = 0

    for i, index in enumerate(indices):
        if i + 1 == indices.numel():
            break
        B_k = B_ws[i + 1]
        shadow_time += send_model_time
        lat_new = 3 * comp_time * B_k + 4 * send_feature_time * B_k + shadow_time

        if lat_new < lat_base:
            lat_base = lat_new
            res[index] = True
        else:
            break
    return res

Thanks!

laekov commented 7 months ago

I think we ignored the T_ij in the actual compuation. So, we are using B_k = B_ws[i + 1] as the approximate maximum amount of work for the workers that are not shadowed.

We assume that the shadowed experts have quite evenly distributed across all workers, so the cost is negligible.

// @TiagoMAntunes please correct me if I understand your code incorrectly.

Guodanding commented 7 months ago

OK. But I think Ignoring T_ij will affect the accuracy, even if the shadowed experts have quite evenly distributed across all workers. For example,

image

Expert0 needs to be shadow, it of course has quite evenly distributed across all workers (6 tokens every worker). And the max computation for every worker is expert0 GeMM computation. If we ignore T_ij, just use B_k = B_ws[i + 1] and then lat_new = 3 * comp_time * B_k + 4 * send_feature_time * B_k + shadow_time, the max comp_time here is expert1 GeMM. Won't it harm?

Additionally, does send_feature_time ignore that the tokens selecting local expert don't need to be send?

Final question. The shadow expert is send to all other workers. So why send_model_time don't need to multiply num_workers-1?

Thanks :) !!!

laekov commented 7 months ago

Expert0 needs to be shadow, it of course has quite evenly distributed across all workers (6 tokens every worker). And the max computation for every worker is expert0 GeMM computation. If we ignore T_ij, just use B_k = B_ws[i + 1] and then lat_new = 3 * comp_time * B_k + 4 * send_feature_time * B_k + shadow_time, the max comp_time here is expert1 GeMM. Won't it harm?

I get your point. We should add the computation time of the shadowed experts to shadow_time here.

Additionally, does send_feature_time ignore that the tokens selecting local expert don't need to be send?

Yes, that is ignored. We assume that there is about 1/world_size error.

Final question. The shadow expert is send to all other workers. So why send_model_time don't need to multiply num_workers-1?

No. A proper broadcast algorithm should do the broadcast to any number of receivers in identical latency.

Guodanding commented 7 months ago

Got it! Thanks! By the way, what do you think of Tutel (or Megatron-DeepSpeed, use dp+tp+ep in MoE layers). In my opinion, Tutel is better at scalability, as it uses a fixed but searchable parallel solution, while FasterMoE is more elegant and fine-grained, but not good at scalability.(I don't know) Have you done some experiment to compare? Please correct me if I misunderstand someting! :)

laekov commented 7 months ago

Got it! Thanks! By the way, what do you think of Tutel (or Megatron-DeepSpeed, use dp+tp+ep in MoE layers).

The FasterMoE paper (ppopp'22) focuses on optimizing EP. Meanwhile, FastMoE, as an constantly developed system, aims at supporting ep with any hybrid parallel strategy (dp / tp / ep / pp / sp / etc.). See the document for details.

An ATC'23 paper, SmartMoE shows a hybrid parallel system based on FastMoE. It outperforms Tutel / DS MoE.

laekov commented 7 months ago

In terms of scalability, if you are talking about using thousands of GPUs (or even more) to train an MoE model, it is true that simply using EP is not efficient, because a2a is not an efficient way of collective communication. Still, you can find a work named Bagualu that uses more than 100,000 processes to train an MoE model using FastMoE and a hybrid parallel strategy.

Guodanding commented 7 months ago

Got it! Thanks!