laekov / fastmoe

A fast MoE impl for PyTorch
https://fastmoe.ai
Apache License 2.0
1.52k stars 184 forks source link

MOELinear always returns a zero tensor for bf16 input #170

Closed xptree closed 1 year ago

xptree commented 1 year ago

fastmoe version: 1.0.2

In the following examples, we have 32768 input tokens and 8 experts, the load balance is quite well so that each expert handles 4096 tokens.

The observation is that out_bf16 is always zero.

import fmoe
import torch

inp = torch.zeros([32768, 128]).cuda().bfloat16().normal_(mean=0.0, std=1e-2)
fwd_expert_count = torch.LongTensor([4096] * 8).cpu()
weight = torch.zeros([8, 4096, 128]).cuda().bfloat16().normal_(mean=0.0, std=1e-2)

out_bf16 = fmoe.linear.MOELinear.apply(inp, fwd_expert_count, weight, None)
print(out_bf16)

out_fp16 = fmoe.linear.MOELinear.apply(inp.half(), fwd_expert_count, weight.half(), None)
print(out_fp16)

out_fp32 = fmoe.linear.MOELinear.apply(inp.float(), fwd_expert_count, weight.float(), None)
print(out_fp32)

Outputs look like:

out_bf16:
tensor([[0., 0., 0.,  ..., 0., 0., -0.],
        [-0., 0., 0.,  ..., 0., -0., -0.],
        [-0., 0., 0.,  ..., 0., -0., 0.],
        ...,
        [0., 0., -0.,  ..., 0., -0., 0.],
        [-0., 0., -0.,  ..., 0., 0., 0.],
        [0., 0., -0.,  ..., -0., 0., 0.]], device='cuda:0',
       dtype=torch.bfloat16)

out_fp16:
tensor([[-1.1816e-03, -1.0592e-04,  2.1780e-04,  ...,  1.7776e-03,
          1.7338e-03,  4.0793e-04],
        [ 2.2566e-04,  1.4105e-03,  5.0735e-04,  ...,  7.4446e-05,
         -1.6594e-03, -1.0805e-03],
        [ 4.9591e-05,  2.5082e-03, -3.7479e-04,  ..., -7.3338e-04,
         -1.0643e-03, -2.0254e-04],
        ...,
        [ 7.0953e-04,  1.1854e-03, -1.1234e-03,  ...,  1.7862e-03,
         -1.4248e-03, -8.0442e-04],
        [-9.5320e-04, -5.2452e-04, -1.5764e-03,  ...,  2.9354e-03,
          8.9931e-04,  5.0020e-04],
        [ 2.1124e-04,  8.0729e-04,  6.5708e-04,  ..., -2.7752e-04,
          1.9302e-03,  8.2016e-04]], device='cuda:0', dtype=torch.float16)

out_fp32:
tensor([[-1.1816e-03, -1.0590e-04,  2.1770e-04,  ...,  1.7774e-03,
          1.7350e-03,  4.0823e-04],
        [ 2.2582e-04,  1.4095e-03,  5.0732e-04,  ...,  7.4334e-05,
         -1.6600e-03, -1.0810e-03],
        [ 5.0056e-05,  2.5109e-03, -3.7395e-04,  ..., -7.3320e-04,
         -1.0637e-03, -2.0104e-04],
        ...,
        [ 7.0883e-04,  1.1854e-03, -1.1226e-03,  ...,  1.7845e-03,
         -1.4251e-03, -8.0487e-04],
        [-9.5288e-04, -5.2501e-04, -1.5780e-03,  ...,  2.9358e-03,
          8.9895e-04,  5.0004e-04],
        [ 2.1135e-04,  8.0700e-04,  6.5658e-04,  ..., -2.7742e-04,
          1.9310e-03,  8.2020e-04]], device='cuda:0')
xptree commented 1 year ago

This PR (https://github.com/laekov/fastmoe/pull/171) seems fix the above bug. Please review it @laekov

The bug comes from function cublasXgemm for bf16 data type in cuda/utils/cublas_wrapper.h.

https://github.com/laekov/fastmoe/blob/master/cuda/utils/cublas_wrapper.h#L125-L130