[Question]: Why is max_token being used for both input and output

Describe your problem

I was looking at the code of the chat() function in api/db/services/dialog_service.

I noticed that max_tokens is being used to limit the input size to LLM and the check is done in message_fit_in. But then the code below follows message_fit_in :

 if "max_tokens" in gen_conf:
        gen_conf["max_tokens"] = min(
            gen_conf["max_tokens"],
            max_tokens - used_token_count)

And this gen_conf["max_tokens"] is later used in rag/llm/chat_model.py inside chat() function of OllamaChat class:

if "max_tokens" in gen_conf: options["num_predict"] = gen_conf["max_tokens"]

This implies that max_tokens is used to limit the output size instead now. And if that is the case, why is the length of the input message (represented by used_token_count) being extracted from max_tokens?

Thank you for helping!

infiniflow / ragflow

[Question]: Why is max_token being used for both input and output #1403

Describe your problem