intel / xFasterTransformer

Apache License 2.0
322 stars 56 forks source link

Can we summarize the meanings of data type like bf16_fp16?, for example, what's activation data type and output data type, what's the computing instruction? #414

Open heagoo opened 2 months ago

Duyi-Wang commented 2 months ago

Sorry, our docs are still in WIP due to ongoing code refactoring.

The mixed data type such as bf16_fp16 and bf16_int8 refers to the usage of BF16 format during the 1st token, while fp16 or int8 type is used during the next token. This is because the 1st token is compute-intensive and highly sensitive to precision, hence we use half precision along with AMX to accelerate computation. However, next token is memory-bound, so lower precision is employed to speed up the process. For the bf16_fp16 type, introduced this type since fp16 performance is better than bf16 in some cases in older versions, but now after optimization, it is recommended to use bf16 instead of bf16_fp16.