mobiusml / hqq

Official implementation of Half-Quadratic Quantization (HQQ)
https://mobiusml.github.io/hqq_blog/
Apache License 2.0
697 stars 68 forks source link

QuantLinear new feature requests #24

Closed Lucky-Lance closed 7 months ago

Lucky-Lance commented 7 months ago

Hello, I am very impressed with your great work. I am not quite familiar with CUDA programming. Would you please kindly give me an instruction about how to call the pack_2bit_u8 of your optimized CUDA (C++) version? I just need to pack and unpack the weights, without quantizing them. Thanks!

mobicham commented 7 months ago

Hi @Lucky-Lance ! The CUDA version only supports bit-unpacking. So you can use the packing functions from https://github.com/mobiusml/hqq/blob/master/hqq/core/bitpack.py and use bit-unpacking from the CUDA extension https://github.com/mobiusml/hqq/blob/master/hqq/kernels/hqq_aten_cuda.cpp#L51

from hqq.core.bitpack import BitPack
import hqq_aten

W_packed    = Bitpack.pack_2bit_u8(W)
W_unpacked = hqq_aten.unpack_2bit_u8(W_packed)
Lucky-Lance commented 7 months ago

OK , thanks a lot!