Support loading 4 bit Qwen2

mengniwang95 commented 2 weeks ago

Support loading 4 bit quantized Qwen2

Error log: File "/home/mewang/workspace/optimum-habana/examples/text-generation/run_generation.py", line 758, in main() File "/home/mewang/workspace/optimum-habana/examples/text-generation/run_generation.py", line 523, in main generate(None, args.reduce_recompile) File "/home/mewang/workspace/optimum-habana/examples/text-generation/run_generation.py", line 494, in generate outputs = model.generate( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) File "/home/mewang/workspace/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1417, in generate result = self._sample( File "/home/mewang/workspace/optimum-habana/optimum/habana/transformers/generation/utils.py", line 2396, in _sample outputs = self( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1565, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 726, in forward return wrapped_hpugraph_forward( File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 599, in wrapped_hpugraph_forward outputs = orig_fwd(*args, kwargs) File "/home/mewang/workspace/optimum-habana/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 793, in forward outputs = self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl result = forward_call(args, kwargs) File "/home/mewang/workspace/optimum-habana/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 699, in forward layer_outputs = decoder_layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl result = forward_call(args, **kwargs) File "/home/mewang/workspace/optimum-habana/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 467, in forward hidden_states, self_attn_weights, present_key_value = self.pre_attn( File "/home/mewang/workspace/optimum-habana/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 518, in pre_attn hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward( File "/home/mewang/workspace/optimum-habana/optimum/habana/transformers/models/qwen2/modeling_qwen2.py", line 319, in pre_attn_forward past_key = torch.zeros(key_states.shape, dtype=self.k_proj.weight.dtype, device=key_states.device) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1732, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'HPUWeightOnlyLinear' object has no attribute 'weight'. Did you mean: 'qweight'?

hshen14 commented 1 week ago

@libinta @sywangyi please review. Thx

jiminha commented 1 week ago

LGTM. Could you also share the model that you tested for this? (GPTQ quantized qwen model)

mengniwang95 commented 1 week ago

LGTM. Could you also share the model that you tested for this? (GPTQ quantized qwen model)

I generated quantized Qwen2 model with this link: https://github.com/intel/neural-compressor/tree/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only#quantization-cpu--hpu

github-actions[bot] commented 1 week ago

The code quality check failed, please run make style.

HuggingFaceDocBuilderDev commented 1 week ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

huggingface / optimum-habana

Support loading 4 bit Qwen2 #1476