intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k stars 1.25k forks source link

[Langchain-Chatchat]Add time consumption msg about first token and rest tokens #10628

Open johnysh opened 6 months ago

johnysh commented 6 months ago

The current model is unable to calculate the time spent on first token and rest tokens, can we add this msg ?

Oscilloscope98 commented 6 months ago

Hi @johnysh,

Currently, we have not natively supported time consumption msg about first token and rest tokens latency in log for Langchain-Chatchat. However, you could do that with the help of ipex-llm benchmark tool.

To use benchmark tool in Langchain-Chatchat:

  1. Put benchmark_util.py in your conda env for langchain-chatchat:

    The path to put the script should be like (taking linux os as an example): /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/benchmark_util.py

  2. In /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py, add BenchmarkWrapper to model:

    That is, change the code here to:

        self.model, self.tokenizer = load_model(
            model_path, device, self.load_in_low_bit, trust_remote_code
        )
    
        from .benchmark_util import BenchmarkWrapper
        self.model = BenchmarkWrapper(self.model)
  3. In /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py, add print message for 1st and rest token latency.

    That is, change the code here to:

        print(f"First token latency (s): {self.model.first_cost}", flush=True)
        print(f"Rest token latency (s): {self.model.rest_cost_mean}", flush=True)
    
        yield json.dumps(json_output).encode() + b"\0"

Please let us know for any further problems :)