intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.64k stars 254 forks source link

How to enable support for AWQ ? #736

Open Pradeepa99 opened 5 days ago

Pradeepa99 commented 5 days ago

Describe the issue

I am trying to enable AWQ support with IPEX repo in CPU.

IPEX 2.5.0 ⁠release states that it has the support for AWQ Quantization.

But we could see only the GPTQ support added in the official repo.

In the below script file, https://github.com/intel/intel-extension-for-pytorch/blob/release/xpu/2.5.10/examples/cpu/llm/inference/utils/run_gptq.py stated that it is deprecated and recommended to use INC.

What is the correct approach that we need to use to enable the support for AWQ with IPEX repo?

Config used:

alexsin368 commented 3 days ago

@Pradeepa99 The release notes mention more support for AWQ format support and it seems it is referring to the usage of ipex.llm.optimize where you can specify the quant_method as 'gptq' or 'awq' for the low_precision_checkpoint argument.

Details here: https://intel.github.io/intel-extension-for-pytorch/cpu/2.5.0+cpu/tutorials/api_doc.html#ipex.llm.optimize

Let us know if this helps put you on the right track.

Pradeepa99 commented 2 days ago

@alexsin368

Thank you for sharing this.

I have three questions to get clarified further.

  1. I found this testcase example to load the AWQ format to ipex.llm.optimize API. - Did you mean this approach to integrate AWQ support in ipex.llm.optimize ?

  2. I found this ⁠example for GPTQ, where they use ipex.quantization.gptq to generate the checkpoint for GPTQ. - Do we have any similar API to generate the checkpoints for AWQ format as well?

  3. Currently, I am following the approach mentioned here from ITREX to generate the quantized model.
    File: https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_cpu_woq.py - Can we quantize the models in the above method or do we follow any specific approach to quantize the models?