failed to run Llama-2-7b-chat-hf on NPU through Sample/Python

aoke79 commented 2 months ago

Dears, I failed to run Llama-2-7b-chat-hf on NPU, please give me a hand.

I converted the mode by below command, and got two models, a) optimum-cli export openvino --task text-generation -m Meta--Llama-2-7b-chat-hf --weight-format int4_sym_g128 --ratio 1.0 ov--Llama-2-7b-chat-hf-int4-sym-g128 b) optimum-cli export openvino --task text-generation -m Meta--Llama-2-7b-chat-hf --weight-format int4 ov--Llama-2-7b-chat-hf-int4
I used chat_sample, benchmark_genai, beam_search_causal_lm, and got the similar results like: a) python beam_search_causal_lm.py c:\AIGC\hf\ov--Llama-2-7b-chat-hf-int4-sym-g128 "why the Sun is yellow?" b) python chat_sample.py c:\AIGC\hf\ov--Llama-2-7b-chat-hf-int4-sym-g128 c) python benchmark_genai.py -m C:\AIGC\openvino\models\ov--Llama-2-7b-chat-hf-int4-sym-g128 -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d CPU

(env_ov_genai) c:\AIGC\openvino\openvino.genai\samples\python\beam_search_causal_lm>python beam_search_causal_lm.py c:\AIGC\hf\ov--Llama-2-7b-chat-hf-int4-sym-g128 "why the Sun is yellow?" Traceback (most recent call last): File "c:\AIGC\openvino\openvino.genai\samples\python\beam_search_causal_lm\beam_search_causal_lm.py", line 29, in main() File "c:\AIGC\openvino\openvino.genai\samples\python\beam_search_causal_lm\beam_search_causal_lm.py", line 24, in main beams = pipe.generate(args.prompts, config) RuntimeError: Exception from src\inference\src\cpp\infer_request.cpp:79: Check '::getPort(port, name, {_impl->get_inputs(), _impl->get_outputs()})' failed at src\inference\src\cpp\infer_request.cpp:79: Port for tensor name beam_idx was not found.

(env_ov_genai) c:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai>python benchmark_genai.py -m c:\AIGC\openvino\models\TinyLlama-1.1B-Chat-v1.0\OV_FP16-4BIT_DEFAULT -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d NPU Traceback (most recent call last): File "c:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai\benchmark_genai.py", line 49, in main() File "c:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai\benchmark_genai.py", line 32, in main pipe.generate(prompt, config) RuntimeError: Exception from C:\Jenkins\workspace\private-ci\ie\build-windows-vs2019\b\repos\openvino.genai\src\cpp\src\llm_pipeline_static.cpp:206: Currently only batch size=1 is supported

(env_ov_genai) c:\AIGC\openvino\openvino.genai\samples\python>python chat_sample.py c:\AIGC\hf\ov--Llama-2-7b-chat-hf-int4-sym-g128 Traceback (most recent call last): File "c:\AIGC\openvino\openvino.genai\samples\python\chat_sample.py", line 43, in main() File "c:\AIGC\openvino\openvino.genai\samples\python\chat_sample.py", line 22, in main pipe = openvino_genai.LLMPipeline(args.model_dir, device) RuntimeError: Exception from src\core\src\pass\stateful_to_stateless.cpp:128: Stateful models without beam_idx input are not supported in StatefulToStateless transformation

I'm not sure if I converted the correct model, so I generated two models like above command line, but neither of them worked. might you please show me how to do that? Thanks a lot

aoke79 commented 2 months ago

pip-list.txt attach the pip list FYI. thanks

aoke79 commented 2 months ago

Can anyone please take a look at this issue? thanks

Wovchena commented 2 months ago

--task is incorrect for optimum-cli. Try text-generation-with-past or don't specify it at all.

aoke79 commented 2 months ago

if removed --task text-generation, will show below comments:

optimum-cli export openvino -m Meta--Llama-2-7b-chat-hf --weight-format int4 ov--Llama-2-7b-chat-hf-int4 Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Scripts\optimum-cli.exe__main.py", line 7, in File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\commands\optimum_cli.py", line 208, in main service.run() File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\commands\export\openvino.py", line 304, in run task = infer_task(self.args.task, self.args.model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\exporters\openvino\main__.py", line 54, in infer_task task = TasksManager.infer_task_from_model(model_name_or_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\exporters\tasks.py", line 1680, in infer_task_from_model task = cls._infer_task_from_model_name_or_path(model, subfolder=subfolder, revision=revision) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\exporters\tasks.py", line 1593, in _infer_task_from_model_name_or_path raise RuntimeError( RuntimeError: Cannot infer the task from a local directory yet, please specify the task manually (image-to-text, image-to-image, image-classification, audio-classification, mask-generation, feature-extraction, zero-shot-image-classification, object-detection, image-segmentation, text-to-audio, semantic-segmentation, masked-im, sentence-similarity, audio-xvector, conversational, audio-frame-classification, stable-diffusion, automatic-speech-recognition, text2text-generation, fill-mask, question-answering, multiple-choice, text-classification, text-generation, zero-shot-object-detection, token-classification, stable-diffusion-xl, depth-estimation).

aoke79 commented 2 months ago

it worked for --task text-generation-with-past, like below:

INFO:nncf:Statistics of the bitwidth distribution: +----------------+-----------------------------+----------------------------------------+ | Num bits (N) | % all parameters (layers) | % ratio-defining parameters (layers) | +================+=============================+========================================+ | 8 | 4% (2 / 226) | 0% (0 / 224) | +----------------+-----------------------------+----------------------------------------+ | 4 | 96% (224 / 226) | 100% (224 / 224) | +----------------+-----------------------------+----------------------------------------+ Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 226/226 • 0:03:17 • 0:00:00 Set tokenizer padding side to left for text-generation-with-past task.

BTW: how can I know which parameter used for which models? Thanks a lot

aoke79 commented 2 months ago

I used the new generated the model, "benchmark_genai" still do not work on that.

python benchmark_genai.py -m C:\AIGC\hf\llama2_7b_chat_ov_int4_default_24_3 -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d NPU Traceback (most recent call last): File "C:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai\benchmark_genai.py", line 49, in main() File "C:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai\benchmark_genai.py", line 32, in main pipe.generate(prompt, config) RuntimeError: Exception from C:\Jenkins\workspace\private-ci\ie\build-windows-vs2019\b\repos\openvino.genai\src\cpp\src\llm_pipeline_static.cpp:206: Currently only batch size=1 is supported

Thanks,

TolyaTalamanov commented 3 weeks ago

Hi @aoke79 the problem should be fixed already, please update packages:

pip uninstall openvino openvino-tokenizers openvino-genai
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

openvinotoolkit / openvino.genai

failed to run Llama-2-7b-chat-hf on NPU through Sample/Python #820