usyd-fsalab / fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Apache License 2.0
171 stars 14 forks source link

Does the repo provide a quantization kernel? #10

Closed yatorho closed 1 month ago

yatorho commented 1 month ago

It seems that the fp6_llm repo only includes the kernel weight_matrix_dequant_fp_eXmY_cpu, which dequantizes fp6 data to fp16 format, but it lacks the kernel to quantize fp16 data to fp6. Could you provide a kernel for quantizing pre-trained models?

gau-nernst commented 1 month ago

I have integrated the FP6 kernel from this repo in torchao with user-friendly API to quantize and run a given model. You can check it out here. https://github.com/pytorch/ao/tree/main/torchao/prototype/quant_llm

The quantization logic is adapted from DeepSpeed as mentioned in #6

yatorho commented 1 month ago

Thanks for the reply. It seems that Torchao has not yet merged this API into the current release. I built it from the source, and it worked for me. Another small question: Torchao only adopted fp6_llm's code for the linear_forward function. Other operations, such as packing and repacking kernels, are all implemented with tensor-level in Python instead of directly using fp6_llm's cpp code?

gau-nernst commented 1 month ago

Yes, packing is done in Python using PyTorch ops. With this approach, we can support CUDA tensors. We also skip unnecessary 6-bit packing, and directly pack to 2+4bit layout as used by FP6-LLM.

yatorho commented 1 month ago

Thank you again! It solved my problem.