when call "/v1/chat/completions", there will call the function
check_length to compute max_new_tokens use min(max_tokens, context_len - token_num) where token_num is len(tokinzer(pormot).input_ids),
but when compute max_src_len use max_src_len = context_len - max_new_tokens - 1 in inference.py, this will lead to truncate the prompt every time.
such as context_len=4096, token_num=len(tokenizer(promot).input_ids)=8, max_new_tokens = 4096-8= 4088
then max_src_len = context_len - max_new_tokens - 1 = 4096-4088-1=7
when use input_ids = input_ids[-max_src_len:] to truncate the prompt, this will drop the first token
when call "/v1/chat/completions", there will call the function check_length to compute
max_new_tokens
use min(max_tokens, context_len - token_num) where token_num is len(tokinzer(pormot).input_ids), but when computemax_src_len
usemax_src_len = context_len - max_new_tokens - 1
in inference.py, this will lead to truncate the prompt every time.such as context_len=4096, token_num=len(tokenizer(promot).input_ids)=8, max_new_tokens = 4096-8= 4088 then max_src_len = context_len - max_new_tokens - 1 = 4096-4088-1=7 when use
input_ids = input_ids[-max_src_len:]
to truncate the prompt, this will drop the first tokenall links: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py#L437 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/base_model_worker.py#L152 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/inference.py#L97 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py#L169