Open johnysh opened 6 months ago
Hi @johnysh,
Currently, we have not natively supported time consumption msg about first token and rest tokens latency in log for Langchain-Chatchat. However, you could do that with the help of ipex-llm benchmark tool.
To use benchmark tool in Langchain-Chatchat:
Put benchmark_util.py
in your conda env for langchain-chatchat:
The path to put the script should be like (taking linux os as an example): /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/benchmark_util.py
In /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py
, add BenchmarkWrapper
to model:
That is, change the code here to:
self.model, self.tokenizer = load_model(
model_path, device, self.load_in_low_bit, trust_remote_code
)
from .benchmark_util import BenchmarkWrapper
self.model = BenchmarkWrapper(self.model)
In /home/<user_name>/<anaconda3 or miniconda3>/envs/<your conda env name>/lib/python3.11/site-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py
, add print message for 1st and rest token latency.
That is, change the code here to:
print(f"First token latency (s): {self.model.first_cost}", flush=True)
print(f"Rest token latency (s): {self.model.rest_cost_mean}", flush=True)
yield json.dumps(json_output).encode() + b"\0"
Please let us know for any further problems :)
The current model is unable to calculate the time spent on first token and rest tokens, can we add this msg ?