intel / xFasterTransformer

Apache License 2.0
322 stars 56 forks source link

What's the meaning of bf16_int4 datatype? #395

Closed LeiZhou-97 closed 2 months ago

LeiZhou-97 commented 2 months ago

In addition to bf16_int4, I also saw some other data types, such as BF16_W8A8. What do these mean? I don't see any docs explaining this.

Duyi-Wang commented 2 months ago

Sorry, docs are still in WIP due to ongoing code refactoring.

The mixed data type such as bf16_int4 refers to the usage of BF16 format during the 1st token, while int4 type is used during the next token. This is because the 1st token is compute-intensive and highly sensitive to precision, hence we use half precision along with AMX to accelerate computation. However, next token is memory-bound, so lower precision is employed to speed up the process.

W8A8 is actually a form of Int8, utilized during computation, while the "int8" type merely indicates that the weights in memory are Int8. Therefore, W8A8 operates faster than Int8, but with lower precision.

LeiZhou-97 commented 2 months ago

So do you recommend to use bf16_w8a8/bf16_int4 when CPU inference?

I tried inference with llama2-7b bf16 datatype, it seems to easily reach the max CPU TDP and causes downclock.

Duyi-Wang commented 2 months ago

It depends on your workload. You can choose an appropriate data type while meeting the accuracy requirements. Each data type can reach the TDP and trigger frequency reduction, as LLM is a fairly heavy workload.

LeiZhou-97 commented 2 months ago

Thank you for your detailed explanation!😊