openvinotoolkit / openvino_notebooks

📚 Jupyter notebook tutorials for OpenVINO™
Apache License 2.0
2.37k stars 802 forks source link

Error Occurs After Asking Consecutive Questions in LLM-Chatbot #2421

Open tim102187S opened 3 weeks ago

tim102187S commented 3 weeks ago

I am using OpenVINO 2024.4.0 and have downloaded the llama-3-8b-instruct model for use. When I run multiple consecutive queries (usually on the third query), an error occurs. I have checked my device’s memory usage, and it has not exceeded 100%.

Here is the error report I received:

Selected model llama-3-8b-instruct Checkbox(value=True, description='Prepare INT4 model') Checkbox(value=False, description='Prepare INT8 model') Checkbox(value=False, description='Prepare FP16 model') Size of model with INT4 compressed weights is 5085.79 MB Loading model from /home/adv/Downloads/openvino_notebooks/notebooks/llm-chatbot/llama-3-8b-instruct/INT4_compressed_weights Compiling the model to CPU ... Running on local URL: http://127.0.0.1:7861

To create a public link, set share=True in launch(). The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation. Traceback (most recent call last): File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/queueing.py", line 536, in process_events response = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/route_utils.py", line 322, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/blocks.py", line 1935, in process_api result = await self.call_function( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/blocks.py", line 1532, in call_function prediction = await utils.async_iteration(iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 671, in async_iteration return await iterator.anext() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 664, in anext return await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2405, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 914, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 647, in run_sync_iterator_async return next(iterator) ^^^^^^^^^^^^^^ File "/home/adv/openvino-llm/lib/python3.12/site-packages/gradio/utils.py", line 809, in gen_wrapper response = next(iterator) ^^^^^^^^^^^^^^ File "/home/adv/Downloads/EAS_GenAI_Intel14th/docker_build/llm_chatbot/run_chatbot.py", line 532, in bot for new_text in streamer: File "/home/adv/openvino-llm/lib/python3.12/site-packages/transformers/generation/streamers.py", line 223, in next value = self.text_queue.get(timeout=self.timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/queue.py", line 179, in get raise Empty _queue.Empty

brmarkus commented 3 weeks ago

Are you talking about "https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot" (where llama-3-8b-instruct is mentioned) or "https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-question-answering" (where e.g. tiny-llama-1b-chat is mentioned)?

Can you provide more details about your system, please (SoC, amount of memory, OS, version of Python, etc.)?

Can you provide example prompts, please?

tim102187S commented 3 weeks ago

Thank you for your response.

I am using the model (llama-3-8b-instruct) and code from this project: https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-chatbot/llm-chatbot.ipynb

Here are my system details:

OS: Ubuntu 24.04 Memory: 32GB CPU: Intel(R) CoreTM Ultra 7 165U Python version: 3.12.3

The prompts I am using also come from the examples in the project, such as:

"hello there! How are you doing?" "What is OpenVINO?" "Who are you?" "Can you explain to me briefly what is Python programming language?" etc.

Please let me know if you need any further information.

brmarkus commented 3 weeks ago

Have you seen errors or warnings in the steps for conversion and compression?

Do you see the same when using the INT8 or FP16 variant instead of the INT4 variant?

Do you start the Jupyter-notebook from within an virtual-environment (with a "guaranteed" set of versions of components), or "global-local", using the components installed globally on your local machine)?

Do you use a specific version or branch of the OpenVINO-Notebooks repo, or the "latest head revision"?

When running under MS-Win11 with the latest version I can query multiple prompts without problems using the INT4 model... (but my Laptop has 64MB of RAM, Core-Ultra-7-155H)

tim102187S commented 3 weeks ago

Thank you for your suggestions.

I did not see any errors or warnings during the model conversion and compression steps.

We have not yet tried using the non-INT4 variants, as the focus of our research project is primarily on INT4 models.

We are running the Jupyter notebook in a Python virtual environment and following the steps outlined in the llm-chatbot.ipynb notebook.

This research project requires the use of the Ubuntu 24.04 system, so we are hoping to resolve the issue within this setup. (During the execution of the chatbot, the memory usage is approximately 7GB, so the errors are not due to insufficient memory.)

brmarkus commented 3 weeks ago

For conversion and compression I would anyway expect the OperatingSystem to start swapping memory to HDD/SSD if the system memory is not big enough...

Let's see if someone else can reproduce it under a similar environment... sorry, I don't see your described problems. Have you modified the code or the model?

Can you reproduce it with another model?

tim102187S commented 3 weeks ago

Thank you for your follow-up.

I have also tried using the llama-2-7b-chatbot model with INT4, INT8, and FP16, and I encountered the same issue in all cases.

Additionally, I would like to clarify that I have not made any modifications to the code or the model.

aleksandr-mokrov commented 2 weeks ago

@tim102187S, looks like it due to 30 seconds timeout. Could you try to increase the value or to delete it at all in this row and check: streamer = TextIteratorStreamer(tok, timeout=30.0, skip_prompt=True, skip_special_tokens=True)