Closed ThisisBillhe closed 9 months ago
We did not implement the kernel, we just adopt the llama.cpp which supports the W4A4 and is used to test the latency.
I see. But still I can not find any reference that llama.cpp support W4A4 operation. I think llama.cpp support weight-only quantization. In your case, is the operation really computed at W4A4 or they are packed/unpacked to FP16 or INT8 for computation?
The kernel is designed for W4A4, you can refer the quantization part of llama.cpp in detail, it has the detailed kernel design inside.
On Thu, Feb 22, 2024 at 21:55 ThisisBillhe @.***> wrote:
I see. But still I can not find any reference that llama.cpp support W4A4 operation. I think llama.cpp support weight-only quantization. In your case, is the operation really computed at W4A4 or they are packed/unpacked to FP16 or INT8 for computation?
— Reply to this email directly, view it on GitHub https://github.com/shawnricecake/EdgeQAT/issues/1#issuecomment-1960665404, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJKFBRVTLINCBTXXUDQ27ILYVAAK5AVCNFSM6AAAAABDV4AJP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGY3DKNBQGQ . You are receiving this because you commented.Message ID: @.***>
As far as I know, there is no INT4 operation support on ARM CPU/GPUs. Even the 4-bit data type is not supported on them...You may refer to the ARM Instruction Sets
As you know, there is no 4-bit x 4-bit operation (we mention that we use the uniform 8-bit x 8-bit operation in paper), the W4A4 in the paper is mainly used to show the accuracy.
As for the latency, we need llama.cpp for the acceleration because the quantization kernel in llama.cpp can help us save much time by minimizing bit wastage to address memory constraints on edge devices.
Such storage alignment is a typical practice of "Trading time for space", as when there is a requirement for loading weights into registers, bitshift operation would be utilized first to extract proper separate weights.
Check section 4.5 and section 5.2 for more details.
In the end, I strongly recommend you discover the q4
related codes in llama.cpp and focus on the memory utilization part of it.
Hope that helps.
If you have any other questions, let me know.
As far as I know, there is no INT4 operation support on ARM CPU/GPUs. Even the 4-bit data type is not supported on them...You may refer to the ARM Instruction Sets
Hello brother, if you want to utilize real W4A4 operations: on RISCV Arch, you could refer to the paper: "Sparq: A Custom RISC-V Vector Processor for Efficient Sub-Byte Quantized Inference" on Armv8 platforms you could refer to the paper: "ULPPACK: FAST SUB-8-BIT MATRIX MULTIPLY ON COMMODITY SIMD HARDWARE".
They both follow the ideas of packing multi-operands into a single operation register, just a compromise practice for the current missing of original instruction support for sub-byte computation. On ArmV9 or higher, ISAs like ARM, has unified instructions like .gemm for MatrixMultiplication, where leaving no space for such operand packing. We have already discovered the possibilities for the latest ISAs for sub-byte support, but seems like we still need to wait for the update from the ISAs.
I noticed that you are a current PhD candidate in ZJU, I graduated from ZJU in 2020 in EE department. Nice to meet you bro :)) I miss my life in Hangzhou.
As far as I know, there is no INT4 operation support on ARM CPU/GPUs. Even the 4-bit data type is not supported on them...You may refer to the ARM Instruction Sets
Hello brother, if you want to utilize real W4A4 operations: on RISCV Arch, you could refer to the paper: "Sparq: A Custom RISC-V Vector Processor for Efficient Sub-Byte Quantized Inference" on Armv8 platforms you could refer to the paper: "ULPPACK: FAST SUB-8-BIT MATRIX MULTIPLY ON COMMODITY SIMD HARDWARE".
They both follow the ideas of packing multi-operands into a single operation register, just a compromise practice for the current missing of original instruction support for sub-byte computation. On ArmV9 or higher, ISAs like ARM, has unified instructions like .gemm for MatrixMultiplication, where leaving no space for such operand packing. We have already discovered the possibilities for the latest ISAs for sub-byte support, but seems like we still need to wait for the update from the ISAs.
I noticed that you are a current PhD candidate in ZJU, I graduated from ZJU in 2020 in EE department. Nice to meet you bro :)) I miss my life in Hangzhou.
Hi there! Thanks for your sharing. I hope you're enjoying your PhD journey at NU.
As you know, there is no 4-bit x 4-bit operation (we mention that we use the uniform 8-bit x 8-bit operation in paper), the W4A4 in the paper is mainly used to show the accuracy.
As for the latency, we need llama.cpp for the acceleration because the quantization kernel in llama.cpp can help us save much time by minimizing bit wastage to address memory constraints on edge devices.
Such storage alignment is a typical practice of "Trading time for space", as when there is a requirement for loading weights into registers, bitshift operation would be utilized first to extract proper separate weights.
Check section 4.5 and section 5.2 for more details.
In the end, I strongly recommend you discover the
q4
related codes in llama.cpp and focus on the memory utilization part of it.
Thanks for your response.
Thanks for your work! But does ARM CPU hardware support W4A4 computation? Can you share your reference or your implementation kernel on ARM CPUs?