Open gaokaiz2 opened 1 month ago
Hi!
Thanks for your question. Unlike weight-only quantization that only needs to replace the weight data with its quantized counterpart (without modification of the model architecture), weight&activation (WA) quantization has to replace the whole linear module (i.e. nn.Linear
) with our customized one (i.e. WALinear
), as the quantization of activation should be performed on the input or output data. Therefore, it's not supported to directly load a WA quantized model with standard AutoModelForCausalLM.from_pretrained()
function, which would utilize the standard modeling_XXX.py
(XXX is the name of a certain model, e.g. llama) file to define the model architecture, while the standard modeling_XXX.py
doesn't have a WALinear
module.
So, if you want to directly load a WA quantized model from a local checkpoint, a customized modeling_XXX.py
that explicitly defines the WALinear
module in place of nn.Linear
must be used. Unfortunately, we doesn't have a customized modeling_XXX.py
now and the WALinear
is dynamically added to the model during the execution of quantize_model()
. Apologize for the inconvenience and we are considering to add a customized modeling file in the future, please stay tuned.
To correctly use WA quantization in your code, simply load the original full-precision model and use model = quantize_model(model, args)
to get everything ready, then you can use the model for inference with weight and activation quantized!
What I've done: I have used your main function to get weight-only quantized versions of LLMs (from local LLM safetensors to local LLM safetensors) and then use the quantized versions to evaluate on lm-evaluation-harness, which also takes an local path as the model. All works well and thanks for your repo!
What I need help with: I am not very sure how to correctly use your functions to use a activation-quantized version of LLMs.
What I've tried: 1) directly use your main function to store activation-quantized version (this should not work because activation quantization should happen in run-time?) 2) manually change the evaluation code s.t. when the model is first loaded, I replace it with your quantized_model(model, args) with kv_bit=16 and a_bit=4; this fails because there the models don't have the named_modules that you use in your code
May I get help with this issue? thx in advance!~