woq quantized model saved following the format like https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/main since itrex v1.4, the quantization config should be added to model.config when model saving and these messages also used for loading.
there is a tricky when woq quantized model do save.
from intel_extension_for_transformers.transformers.llm.quantization.utils import convert_to_quantized_model
from intel_extension_for_transformers.transformers.modeling.modeling_auto import save_low_bit
quantized_model = convert_to_quantized_model(model, quantization_config)
# quantized_model 's Linear is QuantizatiedLinearQbits
self._quantized_model.save_pretrained = types.MethodType(save_low_bit, self._quantized_model)
quantized_model.save_pretrained()
# quantized_model 's Linear will be changed to WeightOnlyLinear, it is due to we would like to save the same format with GPTQ
so if we still want to used the quantizer.quantized_model, we should loading it from local to restore it.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
What does this PR do?
woq quantized model saved following the format like https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/main since itrex v1.4, the quantization config should be added to model.config when model saving and these messages also used for loading. there is a tricky when woq quantized model do save.
so if we still want to used the quantizer.quantized_model, we should loading it from local to restore it.
quickly validated command
Fixes # (issue)
Before submitting