Closed xptree closed 1 year ago
This PR (https://github.com/laekov/fastmoe/pull/171) seems fix the above bug. Please review it @laekov
The bug comes from function cublasXgemm
for bf16 data type in cuda/utils/cublas_wrapper.h
.
https://github.com/laekov/fastmoe/blob/master/cuda/utils/cublas_wrapper.h#L125-L130
const c10::BFloat16 *
can not be directly converted to const float *
.CUDA_R_16BF
, not CUDA_R_16F
.
fastmoe version: 1.0.2
In the following examples, we have 32768 input tokens and 8 experts, the load balance is quite well so that each expert handles 4096 tokens.
The observation is that
out_bf16
is always zero.Outputs look like: