microsoft / semantic-kernel

Integrate cutting-edge LLM technology quickly and easily into your apps
https://aka.ms/semantic-kernel
MIT License
21.31k stars 3.13k forks source link

Best practices to control the size of the ChatHistory to avoid exceeding a models maximum context length #6155

Open markwallace-microsoft opened 4 months ago

markwallace-microsoft commented 4 months ago

Here's an example of the type of error a developer can run into

Unable to generate bot response. Details: Error: 500: Internal Server ErrorMicrosoft.SemanticKernel.KernelException: StreamResponseToClientAsync failed. ---> Microsoft.SemanticKernel.HttpOperationException: This model's maximum context length is 8192 tokens. However, you requested 8677 tokens (7023 in the messages, 630 in the functions, and 1024 in the completion). Please reduce the length of the messages, functions, or completion. Status: 400 (model_error) ErrorCode: context_length_exceeded Content: { "error": { "message": "This model's maximum context length is 8192 tokens. However, you requested 8677 tokens (7023 in the messages, 630 in the functions, and 1024 in the completion). Please reduce the length of the messages, functions, or completion.", "type": "invalid_request_error", "param": "messages", "code": "context_length_exceeded" } } Headers: Access-Control-Allow-Origin: REDACTED apim-request-id: REDACTED x-ratelimit-remaining-requests: REDACTED...

Some options to mitigate this

  1. Examples which show how trim the chat history dynamically e.g. by setting a maximum number of messages etc.
  2. Examples which show how to summarise context information before it is inserted into a prompt
stephentoub commented 4 months ago

Beyond samples, I think we should have some built-in support for this, e.g. an interface that can be queried for to reduce the size of the chat history and some implementations of it readily available, including ones that trim to a max number of tokens or messages, ones that's summarize and replace the previous history with just the most salient points, ones that's remove less important messages and keep only the important ones, etc. This (and possibly other features) might drive the need for taking a dependency on a tokenizer; we'll want to think that through, in conjunction with the abstraction for a tokenizer in Microsoft.ML.Tokenizers cc: @tarekgh (Tarek, and @ericstj, we should think about whether the Tokenizer abstraction should be moved to an abstractions library... today in order to get the abstraction you also need to pay to get all the implementations).

tarekgh commented 4 months ago

today in order to get the abstraction you also need to pay to get all the implementations

In what situation would abstraction be necessary without requiring one of the specific tokenizers? The scenario mentioned doesn't seem to clarify this for me.

stephentoub commented 4 months ago

today in order to get the abstraction you also need to pay to get all the implementations

In what situation would abstraction be necessary without requiring one of the specific tokenizers? The scenario mentioned doesn't seem to clarify this for me.

I'll turn around the question and ask.. what's the reason for having the Tokenizer abstraction at all if every use would require a specific tokenizer? :)

Imagine for this issue there were an IChatHistoryReducer with a method like ChatHistory Reduce(ChatHistory history, int tokenLimit, Tokenizer tokenizer), where implementations of IChatHistoryReducer would need to produce a new ChatHistory containing no more than tokenLimit tokens. They'd need to be able to count tokens according to whatever tokenization algorithm was desired, and thus would need to accept a tokenizer.

It's a similar need for something like TextChunker. Today it has methods that take a delegate to do token counting, but all uses of that today just point to a token counting method. It'd be nice if overloads on TextChunker could just take a Tokenizer directly, for example.

tarekgh commented 4 months ago

Thanks for the thoughts @stephentoub.

I have some experience with Encoding in the framework. In .NET Core, we attempted to separate most of the concrete encodings (those with significant data) into their own libraries. We retained only the abstraction and a few concrete encodings that we believed would be commonly used, such as UTF-8. However, we found that many users wanted access to the other encodings, leading us to include these concrete encodings by default. I'm asking to gain insight into whether we might encounter similar situations with the Tokenizers, or if we anticipate that many libraries will rely on the abstraction without requiring real concrete implementations.

yuichiromukaiyama commented 1 month ago

I'm currently facing this exact issue. From my understanding, if we want to control the token count, we need to use a filter as described in the following URL.

https://github.com/microsoft/semantic-kernel/issues/6572

However, while this works for "kernel.invoke", I believe it doesn't apply when executing something like "service.get_chat_message_contents(history, settings, kernel=kernel, arguments=KernelArguments(settings=settings))". Is there a way to monitor and adjust the token count when it exceeds the model's acceptable limit, even when using "service.get_chat_message_contents" with "ChatHistory"?

At the moment, I'm considering creating a class that inherits from ChatHistory and deletes messages when a certain token count is exceeded.

※ However, I'm still facing a challenge: as we're increasingly using multiple services (kernel.get_service) together, we need to adjust the maximum token count based on the service being used. Relying solely on ChatHistory doesn't fully address this issue, as it doesn't account for the varying token limits across different services...