Closed yatorho closed 1 month ago
I have integrated the FP6 kernel from this repo in torchao with user-friendly API to quantize and run a given model. You can check it out here. https://github.com/pytorch/ao/tree/main/torchao/prototype/quant_llm
The quantization logic is adapted from DeepSpeed as mentioned in #6
Thanks for the reply. It seems that Torchao has not yet merged this API into the current release. I built it from the source, and it worked for me.
Another small question: Torchao only adopted fp6_llm
's code for the linear_forward
function. Other operations, such as packing and repacking kernels, are all implemented with tensor-level in Python instead of directly using fp6_llm
's cpp code?
Yes, packing is done in Python using PyTorch ops. With this approach, we can support CUDA tensors. We also skip unnecessary 6-bit packing, and directly pack to 2+4bit layout as used by FP6-LLM.
Thank you again! It solved my problem.
It seems that the fp6_llm repo only includes the kernel
weight_matrix_dequant_fp_eXmY_cpu
, which dequantizes fp6 data to fp16 format, but it lacks the kernel to quantize fp16 data to fp6. Could you provide a kernel for quantizing pre-trained models?