quic / aimet

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
https://quic.github.io/aimet-pages/index.html
Other
2.09k stars 374 forks source link

Support LLM large models #2678

Open shifeiwen opened 7 months ago

shifeiwen commented 7 months ago

Motivation: LLM is currently revolutionizing people's lives, and Qualcomm's mobile devices play an important role in people's lives. Qualcomm promotes on the Internet that it can achieve a 7B LLM model with a decoding speed of 20 token/s. I think AImet must be involved in this. However, some current mobile LLM quantification technologies are based on W4A16 group quantization. I want to know when AImet can An example is given to perform group quantization of W4A16 on the open source LLM model. This way I can try to reach a higher level with QNN inference. Request: Regarding W4A16 grouping quantization of LLM models. Current attempt: I tried adding the inverse quantization in MLC-LLM as an operator into a 3rd party OPpackage, the cpu version of which I can verify. However, HTP has caused great difficulties for me. Whether it is the completeness of the documentation or the errors reported during compilation, I am very confused. But I'm still trying to use it, and I hope aimet can come up with a similar example. Thanks

quic-hitameht commented 7 months ago

Tagging @quic-sendilk @quic-hsukumar here.

quic-mangal commented 7 months ago

@shifeiwen, can you explain what you mean by group quantization?

Are you looking for a Jupyter NB example which shows quantization simulation for a LLM model?

shifeiwen commented 7 months ago

@quic-mangal In CNN, we often use per-channel or per-layer granularity to quantify the convolution kernel. But the main operation in LLM is matrix multiplication. When performing matrix multiplication, we can use vector-grained quantization (rows or columns of tensor), such as row-by-row or vector-by-vector quantization, to obtain more accurate results. For matrix multiplication A*B=C, we will not directly use the conventional quantization method (per-tensor), but will find each row of A and each column of B to quantize, then perform integer INT calculation, and finally convert the result Returns a floating point result (per-vector). As the number of model parameters of LLM becomes larger and larger, the accuracy requirements for the application of per-vector quantization in LLM are getting higher and higher. The methods of row-by-row quantization of X and column-by-column quantification of W can no longer meet the error requirements, so it is now common to The FP16 elements of each row (each column) are grouped into a group of k in order (k is generally an integer power of 2). The common k numbers are 256 and 128. This is also called per-group quantization. Relatively excellent quantization solutions are similar to gptq or awq, etc. I don’t know if my explanation can make you understand. According to my understanding, the current quantization in qnn can only follow the per-channl method. Of course, all of the above are for PTQ.

quic-mangal commented 6 months ago

@shifeiwen, we don't support block quantization ATM. Only per-tensor and per-channel quantization are supported.