Open Tomas2D opened 1 week ago
So tokenLimit would then be (max_sequence_length - max_output_tokens)? ensuring that the input context does not get trimmed during inference.
Seems reasonable that concrete LLM implementations would need to override this method, and provide a tokenLimit based on the llm context window and max new tokens. Is there something else that I am not considering here?
Based on my observations, we can look at token limits from 3 different perspectives:
Example with WatsonX and Granite 3
max_sequence_length
property, the current value is 4096)max_output_tokens
property, the current value is 8096)max_output_tokens
property which is 8096)For BAM, only the size of the context window is provided. To detect the max input
size, you have to invoke an LLM call with max_new_tokens: 9999999
to trigger an error saying property 'max_new_tokens' must be <= XXXX
For OLLama, only the context window size is provided. Other values seem to be unlimited (no validation error).
For OpenAI, no values are provided; everything must be hard-coded. Limits can be obtained only from an error message.
For Groq, no values are provided (the API probably works similarly to OpenAI).
Now the question is which limit should be passed on TokenMemory
and how the TokenMemory
should behave. In the context of Granite, let's say that my current memory limit is 3000. Now, If I add a new message that has 2000 tokens, it would force the TokenMemory
to remove some old messages (one or more depending on their sizes) to stay under 4096 tokens (because we had initiated the TokenMemory
with such value). Regarding Bee Agent + TokenMemory, this could lead to a situation where the agent (runner) may trigger an error because of this check.
How would you tackle this?
Right now, the
BaseLLM
(class)[/src/llms/base.ts] defines an abstract method calledmeta
that provides meta information about a given model. The response interface (LLMMeta
) defines a single property calledtokenLimit.
The problem is that typically,
tokenLimit
is not enough as typically providers further subdivide limits into the following:input
(max input tokens) - for WatsonX, this field is calledmax_sequence_length.
output
(max generated tokens) - for WatsonX, this field is calledmax_output_tokens.
Because
TokenMemory
behavior heavily depends on thetokenLimit
value, we must be sure that we are not throwing messages out because we have retrieved the wrong value from an LLM provider.The Solution to this issue is to develop (figure out) a better approach that would play nicely with
TokenMemory
and other practical usages.Relates to #159 (Granite context window limit)