i-am-bee / bee-agent-framework

The framework for building scalable agentic applications.
https://i-am-bee.github.io/bee-agent-framework/
Apache License 2.0
1.03k stars 96 forks source link

LLM Meta - token limit definition #150

Open Tomas2D opened 1 week ago

Tomas2D commented 1 week ago

Right now, the BaseLLM (class)[/src/llms/base.ts] defines an abstract method called meta that provides meta information about a given model. The response interface (LLMMeta) defines a single property called tokenLimit.

The problem is that typically, tokenLimit is not enough as typically providers further subdivide limits into the following:

Because TokenMemory behavior heavily depends on the tokenLimit value, we must be sure that we are not throwing messages out because we have retrieved the wrong value from an LLM provider.

The Solution to this issue is to develop (figure out) a better approach that would play nicely with TokenMemory and other practical usages.

Relates to #159 (Granite context window limit)

michael-desmond commented 3 days ago

So tokenLimit would then be (max_sequence_length - max_output_tokens)? ensuring that the input context does not get trimmed during inference.

Seems reasonable that concrete LLM implementations would need to override this method, and provide a tokenLimit based on the llm context window and max new tokens. Is there something else that I am not considering here?

Tomas2D commented 2 days ago

Based on my observations, we can look at token limits from 3 different perspectives:

Example with WatsonX and Granite 3

For BAM, only the size of the context window is provided. To detect the max input size, you have to invoke an LLM call with max_new_tokens: 9999999 to trigger an error saying property 'max_new_tokens' must be <= XXXX

For OLLama, only the context window size is provided. Other values seem to be unlimited (no validation error).

For OpenAI, no values are provided; everything must be hard-coded. Limits can be obtained only from an error message.

For Groq, no values are provided (the API probably works similarly to OpenAI).

Now the question is which limit should be passed on TokenMemory and how the TokenMemory should behave. In the context of Granite, let's say that my current memory limit is 3000. Now, If I add a new message that has 2000 tokens, it would force the TokenMemory to remove some old messages (one or more depending on their sizes) to stay under 4096 tokens (because we had initiated the TokenMemory with such value). Regarding Bee Agent + TokenMemory, this could lead to a situation where the agent (runner) may trigger an error because of this check.

How would you tackle this?