Closed LeiZhou-97 closed 2 months ago
Sorry, docs are still in WIP due to ongoing code refactoring.
The mixed data type such as bf16_int4
refers to the usage of BF16
format during the 1st token, while int4
type is used during the next token. This is because the 1st token is compute-intensive and highly sensitive to precision, hence we use half precision along with AMX to accelerate computation. However, next token is memory-bound, so lower precision is employed to speed up the process.
W8A8
is actually a form of Int8
, utilized during computation, while the "int8" type merely indicates that the weights in memory are Int8. Therefore, W8A8
operates faster than Int8
, but with lower precision.
So do you recommend to use bf16_w8a8/bf16_int4 when CPU inference?
I tried inference with llama2-7b bf16 datatype, it seems to easily reach the max CPU TDP and causes downclock.
It depends on your workload. You can choose an appropriate data type while meeting the accuracy requirements. Each data type can reach the TDP and trigger frequency reduction, as LLM is a fairly heavy workload.
Thank you for your detailed explanation!😊
In addition to bf16_int4, I also saw some other data types, such as BF16_W8A8. What do these mean? I don't see any docs explaining this.