thu-nics / qllm-eval

Code Repository of Evaluating Quantized Large Language Models
MIT License
77 stars 3 forks source link

How to use Activation Quantization? #7

Open gaokaiz2 opened 1 month ago

gaokaiz2 commented 1 month ago

What I've done: I have used your main function to get weight-only quantized versions of LLMs (from local LLM safetensors to local LLM safetensors) and then use the quantized versions to evaluate on lm-evaluation-harness, which also takes an local path as the model. All works well and thanks for your repo!

What I need help with: I am not very sure how to correctly use your functions to use a activation-quantized version of LLMs.

What I've tried: 1) directly use your main function to store activation-quantized version (this should not work because activation quantization should happen in run-time?) 2) manually change the evaluation code s.t. when the model is first loaded, I replace it with your quantized_model(model, args) with kv_bit=16 and a_bit=4; this fails because there the models don't have the named_modules that you use in your code

May I get help with this issue? thx in advance!~

wln20 commented 1 month ago

Hi!

Thanks for your question. Unlike weight-only quantization that only needs to replace the weight data with its quantized counterpart (without modification of the model architecture), weight&activation (WA) quantization has to replace the whole linear module (i.e. nn.Linear) with our customized one (i.e. WALinear), as the quantization of activation should be performed on the input or output data. Therefore, it's not supported to directly load a WA quantized model with standard AutoModelForCausalLM.from_pretrained() function, which would utilize the standard modeling_XXX.py (XXX is the name of a certain model, e.g. llama) file to define the model architecture, while the standard modeling_XXX.py doesn't have a WALinear module.

So, if you want to directly load a WA quantized model from a local checkpoint, a customized modeling_XXX.py that explicitly defines the WALinear module in place of nn.Linear must be used. Unfortunately, we doesn't have a customized modeling_XXX.py now and the WALinear is dynamically added to the model during the execution of quantize_model(). Apologize for the inconvenience and we are considering to add a customized modeling file in the future, please stay tuned.

To correctly use WA quantization in your code, simply load the original full-precision model and use model = quantize_model(model, args) to get everything ready, then you can use the model for inference with weight and activation quantized!