Open liang1wang opened 1 week ago
For example: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/PyTorch-Models/Model/qwen1.5/generate.py
The current inference output is generated all at once. However, typical LLM inference involves generating output token by token, allowing users to see the text gradually being generated. Could you support this type of output? More appreciated if gradio could be also supported, like: gr.ChatInterface
Thanks!
Hi @liang1wang
generate.py
uses model.generate
default API and args, which generates the words to a given length, i.e., n_prdict
. That's why it outputs all results at once.
If you want to generate text gradually, you can change to generate
with streamer
args according to this doc. https://huggingface.co/docs/transformers/v4.31.0/en/internal/generation_utils#transformers.TextStreamer
You can simply replace https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/PyTorch-Models/Model/qwen1.5/generate.py#L50-L70 with this code example.
from transformers import TextStreamer
streamer = TextStreamer(tokenizer)
output = model.generate(input_ids,
streamer=streamer,
max_new_tokens=args.n_predict)
You can also use chat examples provided by ipex-llm:
https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/webui_quickstart.md https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/chatchat_quickstart.md https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/fastchat_quickstart.md https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/privateGPT_quickstart.md
Offline Synced with @liang1wang
Need to use https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/generate.py and TextStreamer.
For gradio streaming chat, please refer to https://www.gradio.app/guides/creating-a-custom-chatbot-with-blocks https://medium.com/@shrinath.suresh/building-an-interactive-streaming-chatbot-with-langchain-transformers-and-gradio-93b97378353e
Start working with gradio chatbot now, thanks for your help qiyuan~
For example: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/PyTorch-Models/Model/qwen1.5/generate.py
The current inference output is generated all at once. However, typical LLM inference involves generating output token by token, allowing users to see the text gradually being generated. Could you support this type of output? More appreciated if gradio could be also supported, like: gr.ChatInterface
Thanks!