prompt will always be truncated

when call "/v1/chat/completions", there will call the function check_length to compute max_new_tokens use min(max_tokens, context_len - token_num) where token_num is len(tokinzer(pormot).input_ids)， but when compute max_src_len use max_src_len = context_len - max_new_tokens - 1 in inference.py, this will lead to truncate the prompt every time.

such as context_len=4096, token_num=len(tokenizer(promot).input_ids)=8, max_new_tokens = 4096-8= 4088 then max_src_len = context_len - max_new_tokens - 1 = 4096-4088-1=7 when use input_ids = input_ids[-max_src_len:] to truncate the prompt, this will drop the first token

all links: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py#L437 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/base_model_worker.py#L152 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/inference.py#L97 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py#L169

lm-sys / FastChat

prompt will always be truncated #3399