LLM Meta - token limit definition

Tomas2D commented 1 week ago

Right now, the BaseLLM (class)[/src/llms/base.ts] defines an abstract method called meta that provides meta information about a given model. The response interface (LLMMeta) defines a single property called tokenLimit.

The problem is that typically, tokenLimit is not enough as typically providers further subdivide limits into the following:

input (max input tokens) - for WatsonX, this field is called max_sequence_length.
output (max generated tokens) - for WatsonX, this field is called max_output_tokens.

Because TokenMemory behavior heavily depends on the tokenLimit value, we must be sure that we are not throwing messages out because we have retrieved the wrong value from an LLM provider.

The Solution to this issue is to develop (figure out) a better approach that would play nicely with TokenMemory and other practical usages.

Relates to #159 (Granite context window limit)

michael-desmond commented 3 days ago

So tokenLimit would then be (max_sequence_length - max_output_tokens)? ensuring that the input context does not get trimmed during inference.

Seems reasonable that concrete LLM implementations would need to override this method, and provide a tokenLimit based on the llm context window and max new tokens. Is there something else that I am not considering here?

Tomas2D commented 2 days ago

Based on my observations, we can look at token limits from 3 different perspectives:

max token input size
max token output size (max number of generated tokens)
max model context window size

Example with WatsonX and Granite 3

max token input size (max_sequence_length property, the current value is 4096)
max token output size (max_output_tokens property, the current value is 8096)
max token context window size (not defined, but actually, it is the max_output_tokens property which is 8096)

For BAM, only the size of the context window is provided. To detect the max input size, you have to invoke an LLM call with max_new_tokens: 9999999 to trigger an error saying property 'max_new_tokens' must be <= XXXX

For OLLama, only the context window size is provided. Other values seem to be unlimited (no validation error).

For OpenAI, no values are provided; everything must be hard-coded. Limits can be obtained only from an error message.

For Groq, no values are provided (the API probably works similarly to OpenAI).

Now the question is which limit should be passed on TokenMemory and how the TokenMemory should behave. In the context of Granite, let's say that my current memory limit is 3000. Now, If I add a new message that has 2000 tokens, it would force the TokenMemory to remove some old messages (one or more depending on their sizes) to stay under 4096 tokens (because we had initiated the TokenMemory with such value). Regarding Bee Agent + TokenMemory, this could lead to a situation where the agent (runner) may trigger an error because of this check.

How would you tackle this?

i-am-bee / bee-agent-framework

LLM Meta - token limit definition #150