llama_int8 do not support do_sample=True

Describe the bug

with demo run_llama_int8.py, setting generate_kwargs["do_sample"] to be True, I got the error as follows:

command: python run_llama_int8.py -m ${MODEL_ID} --quantized-model-path "/workspace/saved_results/best_model.pt" --benchmark --jit --int8-bf16-mixed --num-iter 5 --prompt "hello"

error log: /opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py:1405: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cpu, whereas the model is on meta. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('meta') before running .generate(). warnings.warn( Traceback (most recent call last): File "/lzw/run_llama_int8.py", line 378, in output = user_model.generate( File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py", line 1485, in generate return self.sample( File "/opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py", line 2524, in sample outputs = self( File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl return forward_call(args, kwargs) File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/models.py", line 624, in LlamaForCausalLM_forward outputs = self.model( File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl return forward_call(args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/models.py", line 283, in LlamaModel_forward attention_mask = self._prepare_decoder_attention_mask( File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/attentions.py", line 65, in _prepare_decoder_attention_mask combined_attention_mask = _make_causal_mask( File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/attentions.py", line 18, in _make_causal_mask mask = torch.full( NotImplementedError: Could not run 'aten::_local_scalar_dense' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_local_scalar_dense' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradMeta, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

do_sample is an import feature for me.

Versions

[pip3] intel-extension-for-pytorch==2.1.0.dev0+cpu.llm [pip3] numpy==1.24.1 [pip3] torch==2.1.0.dev20230711+cpu [pip3] torchaudio==2.1.0.dev20230711+cpu [pip3] torchvision==0.16.0.dev20230711+cpu [conda] intel-extension-for-pytorch 2.1.0.dev0+cpu.llm pypi_0 pypi [conda] numpy 1.24.1 pypi_0 pypi [conda] torch 2.1.0.dev20230711+cpu pypi_0 pypi [conda] torchaudio 2.1.0.dev20230711+cpu pypi_0 pypi [conda] torchvision 0.16.0.dev20230711+cpu pypi_0 pypi

intel / intel-extension-for-pytorch

llama_int8 do not support do_sample=True #430

Describe the bug

Versions