I was looking at the code of the chat() function in api/db/services/dialog_service.
I noticed that max_tokens is being used to limit the input size to LLM and the check is done in message_fit_in. But then
the code below follows message_fit_in :
if "max_tokens" in gen_conf:
gen_conf["max_tokens"] = min(
gen_conf["max_tokens"],
max_tokens - used_token_count)
And this gen_conf["max_tokens"] is later used in rag/llm/chat_model.py inside chat() function of OllamaChat class:
if "max_tokens" in gen_conf: options["num_predict"] = gen_conf["max_tokens"]
This implies that max_tokens is used to limit the output size instead now. And if that is the case, why is the length of the input message (represented by used_token_count) being extracted from max_tokens?
Describe your problem
I was looking at the code of the
chat()
function inapi/db/services/dialog_service
.I noticed that
max_tokens
is being used to limit the input size to LLM and the check is done inmessage_fit_in
. But then the code below followsmessage_fit_in
:And this
gen_conf["max_tokens"]
is later used inrag/llm/chat_model.py
insidechat()
function ofOllamaChat
class:This implies that max_tokens is used to limit the output size instead now. And if that is the case, why is the length of the input message (represented by
used_token_count
) being extracted frommax_tokens
?Thank you for helping!