Closed digbangbang closed 1 month ago
Due to the difference between quantization and sparsification, and the computation reduction of different algorithms for quantitative comparison, we use the same bit width as the benchmark to calculate MACs(eg. 32-bit floating point operation), so that the computation improvement of the quantization method can also be quantized and compared fairly..
Thanks for ur reply!
My doubts are about the two experiments of Smoothquant and Omniquant. According to my understanding, quantizing activations and weights will not affect MACs, that is, the amount of calculation will not be reduced, because it will only affect the usage of cuda memory and the inference speed. So how is the 3.24T here calculated, especially the Smoothquant and Omniquant?
Looking forward to ur reply, since I have done similar work before, I also wanted to compare quantization and sparsification on MACs. However, I finally found that sparsification can reduce MACs, such as structured sparsity, but quantization does not seem to have any effect.
Thank you for your interest in our work.
In model acceleration, in order to be able to quantify quantization and compare it with sparsification, it is necessary to clarify the bit width of the floating-point operation that MACs relies on (for example, 8-bit floating-point operation is 4 times faster than 32-bit floating-point operation). Therefore, we take 32-bit floating-point operation as the reference value for MACs calculation, and on this basis calculate to determine the impact of sparsification and quantization algorithm on computation reduction. I hope this answers your questions.
Best
Thanks ur detailed reply, I understand what you mean. 😊
The experimental results in the paper are set for MACs. In my opinion, quantization should not affect MACs. MACs refers to the number of operations of a multiplication and addition. How can smoothquant reduce MACs? Also, many quantizations in the paper reduce MACs.