intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.26k stars 1.23k forks source link

Generate result token by token when inference. #11464

Open liang1wang opened 1 week ago

liang1wang commented 1 week ago

For example: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/PyTorch-Models/Model/qwen1.5/generate.py

The current inference output is generated all at once. However, typical LLM inference involves generating output token by token, allowing users to see the text gradually being generated. Could you support this type of output? More appreciated if gradio could be also supported, like: gr.ChatInterface

Thanks!

qiyuangong commented 4 days ago

For example: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/PyTorch-Models/Model/qwen1.5/generate.py

The current inference output is generated all at once. However, typical LLM inference involves generating output token by token, allowing users to see the text gradually being generated. Could you support this type of output? More appreciated if gradio could be also supported, like: gr.ChatInterface

Thanks!

Hi @liang1wang

generate.py uses model.generate default API and args, which generates the words to a given length, i.e., n_prdict. That's why it outputs all results at once.

If you want to generate text gradually, you can change to generate with streamer args according to this doc. https://huggingface.co/docs/transformers/v4.31.0/en/internal/generation_utils#transformers.TextStreamer

You can simply replace https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/PyTorch-Models/Model/qwen1.5/generate.py#L50-L70 with this code example.

    from transformers import TextStreamer
    streamer = TextStreamer(tokenizer)
        output = model.generate(input_ids,
                                streamer=streamer,
                                max_new_tokens=args.n_predict)

You can also use chat examples provided by ipex-llm:

https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/webui_quickstart.md https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/chatchat_quickstart.md https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/fastchat_quickstart.md https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/privateGPT_quickstart.md

qiyuangong commented 2 days ago

Offline Synced with @liang1wang

Need to use https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5/generate.py and TextStreamer.

For gradio streaming chat, please refer to https://www.gradio.app/guides/creating-a-custom-chatbot-with-blocks https://medium.com/@shrinath.suresh/building-an-interactive-streaming-chatbot-with-langchain-transformers-and-gradio-93b97378353e

liang1wang commented 17 hours ago

Start working with gradio chatbot now, thanks for your help qiyuan~