Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:01<00:00, 3.36it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using framework PyTorch: 2.3.1+cpu
Overriding 1 configuration item(s)
use_cache -> True
/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/exporters/openvino/model_patcher.py:452: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if sequence_length != 1:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Statistics collection ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/128 • 0:00:00 • -:--:--
Traceback (most recent call last):
File "/home/intel/Flex/jason/ov_nb_env/bin/optimum-cli", line 8, in
sys.exit(main())
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main
service.run()
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/export/openvino.py", line 345, in run
model = OVModelForCausalLM.from_pretrained(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/modeling_base.py", line 402, in from_pretrained
return from_pretrained_method(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 301, in _from_transformers
return cls._from_pretrained(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 815, in _from_pretrained
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config_copy))
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 295, in quantize
self._quantize_ovbasemodel(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 411, in _quantize_ovbasemodel
_weight_only_quantization(self.model.model, quantization_config, calibration_dataset)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 824, in _weight_only_quantization
return nncf.compress_weights(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/quantize_model.py", line 522, in compress_weights
return compression_weights_impl(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/quantization/quantize_model.py", line 461, in compress_weights_impl
return compression_algorithm.apply(model, graph, dataset=dataset)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 305, in apply
activations = self._get_activations(dataset, self._subset_size, nodes_to_compress, graph, model)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 523, in _get_activations
statistics_aggregator.collect_statistics(model, graph)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/statistics/aggregator.py", line 36, in collect_statistics
super().collect_statistics(model, graph)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/common/tensor_statistics/aggregator.py", line 78, in collect_statistics
outputs = engine.infer(input_data)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 85, in infer
return self.engine.infer(input_data)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 48, in infer
model_outputs = self.infer_request.infer(input_data, share_inputs=True)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/openvino/runtime/ie_api.py", line 132, in infer
return OVDict(super().infer(_data_dispatch(
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:223:
Exception from src/plugins/intel_cpu/src/graph.cpp:1367:
Node module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention of type ScaledDotProductAttentionWithKVCache
Check 'm_k_state && m_v_state' failed at src/plugins/intel_cpu/src/nodes/scaled_attn.cpp:972:
ScaledDotProductAttentionWithKVCache node with name 'module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention' has null input states
Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:01<00:00, 3.36it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using framework PyTorch: 2.3.1+cpu
Overriding 1 configuration item(s)
use_cache -> True
/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/exporters/openvino/model_patcher.py:452: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if sequence_length != 1:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Statistics collection ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/128 • 0:00:00 • -:--:--
Traceback (most recent call last):
File "/home/intel/Flex/jason/ov_nb_env/bin/optimum-cli", line 8, in
sys.exit(main())
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main
service.run()
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/export/openvino.py", line 345, in run
model = OVModelForCausalLM.from_pretrained(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/modeling_base.py", line 402, in from_pretrained
return from_pretrained_method(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 301, in _from_transformers
return cls._from_pretrained(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 815, in _from_pretrained
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config_copy))
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 295, in quantize
self._quantize_ovbasemodel(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 411, in _quantize_ovbasemodel
_weight_only_quantization(self.model.model, quantization_config, calibration_dataset)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 824, in _weight_only_quantization
return nncf.compress_weights(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/quantize_model.py", line 522, in compress_weights
return compression_weights_impl(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/quantization/quantize_model.py", line 461, in compress_weights_impl
return compression_algorithm.apply(model, graph, dataset=dataset)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 305, in apply
activations = self._get_activations(dataset, self._subset_size, nodes_to_compress, graph, model)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 523, in _get_activations
statistics_aggregator.collect_statistics(model, graph)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/statistics/aggregator.py", line 36, in collect_statistics
super().collect_statistics(model, graph)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/common/tensor_statistics/aggregator.py", line 78, in collect_statistics
outputs = engine.infer(input_data)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 85, in infer
return self.engine.infer(input_data)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 48, in infer
model_outputs = self.infer_request.infer(input_data, share_inputs=True)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/openvino/runtime/ie_api.py", line 132, in infer
return OVDict(super().infer(_data_dispatch(
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:223:
Exception from src/plugins/intel_cpu/src/graph.cpp:1367:
Node module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention of type ScaledDotProductAttentionWithKVCache
Check 'm_k_state && m_v_state' failed at src/plugins/intel_cpu/src/nodes/scaled_attn.cpp:972:
ScaledDotProductAttentionWithKVCache node with name 'module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention' has null input states
Environment information
Please run python check_install.py in the _openvinonotebooks directory. If the output is NOT OK for any of the checks, please follow the instructions to fix that. If that does not work, or if you still encounter the issue, please paste the output of check_install.py here.
Additional context
Add any other context about the problem here.
Describe the bug There are some issues converting the llama3 model to int4 quantization and running the int8 quantized model.
Expected behavior A clear and concise description of what you expected to happen.
Screenshots Here's the error log. Export command:
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.8 --sym --awq --dataset wikitext2 --num-samples 128 llama-3-8b-instruct/INT4_compressed_weights
Framework not specified. Using pt to export the model. Loading checkpoint shards: 100%|██████████████████| 4/4 [00:01<00:00, 3.36it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using framework PyTorch: 2.3.1+cpu Overriding 1 configuration item(s)
Here's the error log for running int8 model.
Export command:
optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B-Instruct --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.8 --sym --awq --dataset wikitext2 --num-samples 128 llama-3-8b-instruct/INT4_compressed_weights
Framework not specified. Using pt to export the model. Loading checkpoint shards: 100%|██████████████████| 4/4 [00:01<00:00, 3.36it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using framework PyTorch: 2.3.1+cpu Overriding 1 configuration item(s)
use_cache -> True /home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/exporters/openvino/model_patcher.py:452: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if sequence_length != 1: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Statistics collection ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/128 • 0:00:00 • -:--:-- Traceback (most recent call last): File "/home/intel/Flex/jason/ov_nb_env/bin/optimum-cli", line 8, in
sys.exit(main())
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main
service.run()
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/commands/export/openvino.py", line 345, in run
model = OVModelForCausalLM.from_pretrained(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/modeling_base.py", line 402, in from_pretrained
return from_pretrained_method(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 301, in _from_transformers
return cls._from_pretrained(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/modeling_decoder.py", line 815, in _from_pretrained
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config_copy))
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 295, in quantize
self._quantize_ovbasemodel(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 411, in _quantize_ovbasemodel
_weight_only_quantization(self.model.model, quantization_config, calibration_dataset)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/optimum/intel/openvino/quantization.py", line 824, in _weight_only_quantization
return nncf.compress_weights(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/quantize_model.py", line 522, in compress_weights
return compression_weights_impl(
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/quantization/quantize_model.py", line 461, in compress_weights_impl
return compression_algorithm.apply(model, graph, dataset=dataset)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 305, in apply
activations = self._get_activations(dataset, self._subset_size, nodes_to_compress, graph, model)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/quantization/algorithms/weight_compression/algorithm.py", line 523, in _get_activations
statistics_aggregator.collect_statistics(model, graph)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/statistics/aggregator.py", line 36, in collect_statistics
super().collect_statistics(model, graph)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/common/tensor_statistics/aggregator.py", line 78, in collect_statistics
outputs = engine.infer(input_data)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 85, in infer
return self.engine.infer(input_data)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/nncf/openvino/engine.py", line 48, in infer
model_outputs = self.infer_request.infer(input_data, share_inputs=True)
File "/home/intel/Flex/jason/ov_nb_env/lib/python3.10/site-packages/openvino/runtime/ie_api.py", line 132, in infer
return OVDict(super().infer(_data_dispatch(
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:223:
Exception from src/plugins/intel_cpu/src/graph.cpp:1367:
Node module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention of type ScaledDotProductAttentionWithKVCache
Check 'm_k_state && m_v_state' failed at src/plugins/intel_cpu/src/nodes/scaled_attn.cpp:972:
ScaledDotProductAttentionWithKVCache node with name 'module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention' has null input states
Installation instructions (Please mark the checkbox) [ O] I followed the installation guide at https://github.com/openvinotoolkit/openvino_notebooks#-installation-guide to install the notebooks.
Environment information Please run
python check_install.py
in the _openvinonotebooks directory. If the output is NOT OK for any of the checks, please follow the instructions to fix that. If that does not work, or if you still encounter the issue, please paste the output of check_install.py here.Additional context Add any other context about the problem here.