infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
22.13k stars 2.17k forks source link

[Question]: Why is max_token being used for both input and output #1403

Open PROGRAMMERHAO opened 4 months ago

PROGRAMMERHAO commented 4 months ago

Describe your problem

I was looking at the code of the chat() function in api/db/services/dialog_service.

I noticed that max_tokens is being used to limit the input size to LLM and the check is done in message_fit_in. But then the code below follows message_fit_in :

 if "max_tokens" in gen_conf:
        gen_conf["max_tokens"] = min(
            gen_conf["max_tokens"],
            max_tokens - used_token_count)

And this gen_conf["max_tokens"] is later used in rag/llm/chat_model.py inside chat() function of OllamaChat class:

if "max_tokens" in gen_conf: options["num_predict"] = gen_conf["max_tokens"]

This implies that max_tokens is used to limit the output size instead now. And if that is the case, why is the length of the input message (represented by used_token_count) being extracted from max_tokens?

Thank you for helping!

KevinHuSh commented 4 months ago

In Ollama, the definition of max_tokens indeed is different from others. BTW, you could start the project to follow. Thanks!