LLM-chatbot llama3 model issue

Describe the bug There are some issues converting the llama3 model to int4 quantization and running the int8 quantized model.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots Here's the error log. Export command:

optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.8 --sym --awq --dataset wikitext2 --num-samples 128 llama-3-8b-instruct/INT4_compressed_weights

Framework not specified. Using pt to export the model. Loading checkpoint shards: 100%|██████████████████| 4/4 [00:01<00:00, 3.36it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using framework PyTorch: 2.3.1+cpu Overriding 1 configuration item(s)

use_cache -> True /home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/exporters/openvino/model_patcher.py:452: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if sequence_length != 1: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Statistics collection ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/128 • 0:00:00 • -:--:-- Traceback (most recent call last): File "/home/intel/Flex/jason/ov_nb_env/bin/optimum-cli", line 8, in sys.exit(main()) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main service.run() File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/export/openvino.py", line 345, in run model = OVModelForCausalLM.from_pretrained( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/modeling_base.py", line 402, in from_pretrained return from_pretrained_method( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 301, in _from_transformers return cls._from_pretrained( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 815, in _from_pretrained quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config_copy)) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 295, in quantize self._quantize_ovbasemodel( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 411, in _quantize_ovbasemodel _weight_only_quantization(self.model.model, quantization_config, calibration_dataset) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 824, in _weight_only_quantization return nncf.compress_weights( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/quantize_model.py", line 522, in compress_weights return compression_weights_impl( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/quantization/quantize_model.py", line 461, in compress_weights_impl return compression_algorithm.apply(model, graph, dataset=dataset) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 305, in apply activations = self._get_activations(dataset, self._subset_size, nodes_to_compress, graph, model) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 523, in _get_activations statistics_aggregator.collect_statistics(model, graph) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/statistics/aggregator.py", line 36, in collect_statistics super().collect_statistics(model, graph) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/common/tensor_statistics/aggregator.py", line 78, in collect_statistics outputs = engine.infer(input_data) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 85, in infer return self.engine.infer(input_data) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 48, in infer model_outputs = self.infer_request.infer(input_data, share_inputs=True) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/openvino/runtime/ie_api.py", line 132, in infer return OVDict(super().infer(_data_dispatch( RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:223: Exception from src/plugins/intel_cpu/src/graph.cpp:1367: Node module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention of type ScaledDotProductAttentionWithKVCache Check 'm_k_state && m_v_state' failed at src/plugins/intel_cpu/src/nodes/scaled_attn.cpp:972: ScaledDotProductAttentionWithKVCache node with name 'module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention' has null input states

Here's the error log for running int8 model.

Export command:

use_cache -> True /home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/exporters/openvino/model_patcher.py:452: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if sequence_length != 1: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Statistics collection ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/128 • 0:00:00 • -:--:-- Traceback (most recent call last): File "/home/intel/Flex/jason/ov_nb_env/bin/optimum-cli", line 8, in sys.exit(main()) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main service.run() File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/export/openvino.py", line 345, in run model = OVModelForCausalLM.from_pretrained( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/modeling_base.py", line 402, in from_pretrained return from_pretrained_method( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 301, in _from_transformers return cls._from_pretrained( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 815, in _from_pretrained quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config_copy)) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 295, in quantize self._quantize_ovbasemodel( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 411, in _quantize_ovbasemodel _weight_only_quantization(self.model.model, quantization_config, calibration_dataset) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 824, in _weight_only_quantization return nncf.compress_weights( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/quantize_model.py", line 522, in compress_weights return compression_weights_impl( File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/quantization/quantize_model.py", line 461, in compress_weights_impl return compression_algorithm.apply(model, graph, dataset=dataset) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 305, in apply activations = self._get_activations(dataset, self._subset_size, nodes_to_compress, graph, model) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 523, in _get_activations statistics_aggregator.collect_statistics(model, graph) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/statistics/aggregator.py", line 36, in collect_statistics super().collect_statistics(model, graph) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/common/tensor_statistics/aggregator.py", line 78, in collect_statistics outputs = engine.infer(input_data) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 85, in infer return self.engine.infer(input_data) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 48, in infer model_outputs = self.infer_request.infer(input_data, share_inputs=True) File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/openvino/runtime/ie_api.py", line 132, in infer return OVDict(super().infer(_data_dispatch( RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:223: Exception from src/plugins/intel_cpu/src/graph.cpp:1367: Node module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention of type ScaledDotProductAttentionWithKVCache Check 'm_k_state && m_v_state' failed at src/plugins/intel_cpu/src/nodes/scaled_attn.cpp:972: ScaledDotProductAttentionWithKVCache node with name 'module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention' has null input states

Installation instructions (Please mark the checkbox) [ O] I followed the installation guide at https://github.com/openvinotoolkit/openvino_notebooks#-installation-guide to install the notebooks.

Environment information Please run python check_install.py in the _openvinonotebooks directory. If the output is NOT OK for any of the checks, please follow the instructions to fix that. If that does not work, or if you still encounter the issue, please paste the output of check_install.py here.

Additional context Add any other context about the problem here.

openvinotoolkit / openvino_notebooks

LLM-chatbot llama3 model issue #2133