huggingface / optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools
https://huggingface.co/docs/optimum/main/en/intel/index
Apache License 2.0
355 stars 99 forks source link

Fix INC WoQ model loading issue #772

Open changwangss opened 2 weeks ago

changwangss commented 2 weeks ago

What does this PR do?

woq quantized model saved following the format like https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/main since itrex v1.4, the quantization config should be added to model.config when model saving and these messages also used for loading. there is a tricky when woq quantized model do save.

from intel_extension_for_transformers.transformers.llm.quantization.utils import convert_to_quantized_model
from intel_extension_for_transformers.transformers.modeling.modeling_auto import save_low_bit

quantized_model = convert_to_quantized_model(model, quantization_config)
# quantized_model 's Linear is QuantizatiedLinearQbits 
self._quantized_model.save_pretrained = types.MethodType(save_low_bit, self._quantized_model)
quantized_model.save_pretrained()
# quantized_model 's Linear will be changed to WeightOnlyLinear, it is due to we would like to save the same format with GPTQ

so if we still want to used the quantizer.quantized_model, we should loading it from local to restore it.

quickly validated command

 python run_clm.py --model_name_or_path EleutherAI/gpt-neo-125M --dataset_name wikitext  --dataset_config_name wikitext-2-raw-v1 --apply_quantization --quantization_approach weight_only --verify_loading  --output_dir ./tmp/clm_output

Fixes # (issue)

Before submitting

HuggingFaceDocBuilderDev commented 2 weeks ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.