lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.56k stars 4.51k forks source link

prompt will always be truncated #3399

Open jiaoyangkuohai opened 3 months ago

jiaoyangkuohai commented 3 months ago

when call "/v1/chat/completions", there will call the function check_length to compute max_new_tokens use min(max_tokens, context_len - token_num) where token_num is len(tokinzer(pormot).input_ids), but when compute max_src_len use max_src_len = context_len - max_new_tokens - 1 in inference.py, this will lead to truncate the prompt every time.

such as context_len=4096, token_num=len(tokenizer(promot).input_ids)=8, max_new_tokens = 4096-8= 4088 then max_src_len = context_len - max_new_tokens - 1 = 4096-4088-1=7 when use input_ids = input_ids[-max_src_len:] to truncate the prompt, this will drop the first token

all links: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py#L437 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/base_model_worker.py#L152 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/inference.py#L97 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py#L169