Hi,
I followed the instructions here to compile llama model into .vmfb.
I specified the quantization to 4bits and precision to f16, and I got the mlir like:
Seems the int4 weights was dequantized to f16 and the computation(matmul) is in f16.
Does the quantization support that quantize the activation f16 to q4/q8 and compute in q4/q8? Like what llama.cpp is doing for CPU (the E approach in this article).
Hi, I followed the instructions here to compile llama model into .vmfb. I specified the quantization to 4bits and precision to f16, and I got the mlir like:
Seems the int4 weights was dequantized to f16 and the computation(matmul) is in f16. Does the quantization support that quantize the activation f16 to q4/q8 and compute in q4/q8? Like what llama.cpp is doing for CPU (the
E
approach in this article).Thanks.