Open simonw opened 1 year ago
This just came up again for the llm chat
command:
Single biggest unanswered question, which goes for the existing
llm -c
conversation mode as well: what happens if the conversation gets longer than the context window?I assume different models break in different ways. But how to fix this? Two options:
- Prevent the conversation from continuing past that point
- Truncate the conversation's start (though keep injecting the system prompt) to fit
But in both cases I need to detect when this happens. I could try and catch the error and retry, but that's dependent on knowing what the error looks like.
I could count tokens and predict the error will occur, but I need to have rock-solid token counting for that (which I can get using
tiktoken
for the OpenAI models, but no idea how I'd get it for other models in plugins).Maybe part of the answer here is introducing a new standard exception -
llm.PromptTooLong
perhaps - and then updating all the plugins to raise that exception.
There's a really fancy version of this, where any time you run low on tokens you get the LLM to summarize the previous conversation history in order to condense it up.
Not sure if that should be a feature of LLM directly, but it's pretty interesting.
I'm using the llm chat
prototype to see how much it takes to break Llama 2 13B, which has a 4096 documented token limit.
I'm surprised at how much space that is. I've been getting the model to tell me jokes, tell stories etc and I'm still only at about 2713 tokens - counting them like this:
llm logs -c | ttok
That's the GPT-4 tokenizer, not the Llama 2 one, but I imagine the number is pretty close.
I'm at 4492 now and it's still going?
Up to 6563 now. I'm suspecting Llama 2 (with MLC) may be truncating the input for me.
Here's the full log of that conversation. https://gist.github.com/simonw/603dba123237a6e4b36e9bc5bc70b583
Yeah, Llama 2 MLC doesn't seem to have a limit. I piped in a 27000 token CSV:
cat simon-wordcamp.csv | llm -m llama2 --system 'summary'
The response cut off, but it clearly caught the end of the script:
The transcript you provided is a video of a Q&A session with a group of people discussing various topics related to artificial intelligence (AI) and machine learning. The speakers are discussing their experiences and perspectives on the current state of AI research, including the limitations of current models and the need for more recent training data. They also discuss the use of retrieval augmented generation as a state-of-the-art technique for factual questions, and the potential for directing what the AI is indexing.
Here are some key points that can be gleaned from the transcript:
gpt-3.5-turbo
on the other hand:
cat simon-wordcamp.csv | llm -m gpt-3.5-turbo --system 'summary'
Error: This model's maximum context length is 4097 tokens. However, your messages resulted in 29753 tokens. Please reduce the length of the messages.
I like the idea of truncating middle text. Keeping the first prompt(s) and the last amount that will still fit in context.
In a normal chat
"it keeps going" is real, it feels like a lot. But if the conversation starts with an info dump (command line RAG, cat
in a text file or a long !multi ... !end), then the context runs out just when the conversation gets interesting.
Using the Python API I've experimented with versions that summarize the whole past (N-1) conversation as future input context, a history so far, and even this naive approach works to some degree. It never gives this modern buffer overflow of context running out. The bot knows what's going on and how we got here. But it doesn't know what I mean if the follow-up is "that sounds neat, can you make it shorter". That is, referring to the exact previous message. In my naive implementation the previous message is the whole history so far. Yet it's surprisingly effective in carrying a conversation.
I feel like the chat
side of llm
should take some opinion of context management → Chat mode as pure magic. What's impossible then is to know what the context size is. Because that is near-unknown for a gguf, fully unknown for an OpenAI-compliant REST API, and only readable in a .json file for gptq/awq.
Using something like -1000, -2000 or -3000 tokens as the history → summary cutoff point might result in the right effect for all future models, ChatGPT-like long conversations, over time it will accumulate hallucinations, but usually not errors. It's a bit of a hack, but the result is magic. The status quo is "error", this alternative at least keeps going and is full aware of many messages of the past.
And that gives a permission for the non-chat mode to be very absolute, "what you send is what you get", errors and all. Seeing the low level errors is important when testing large context windows manually with large inputs. The docs could guide users to use chat when they want magic, and non-chat when they want the details.
Together those would cover both ends of the context management spectrum.
Python API users can choose to mimic what chat
does in between those ends. They can use the conversation step -1 or -5 of whatever they think is the correct cutoff point for past summarization, depending on their RAG chunks size and the model they know. Which they can do themselves, following the CLI code as a reference. I don't think the Python API needs more than a documented way to point to "conversations from N-3 and before".
That's from the perspective of using chat, non-chat and the Python API with a lot of models. What a great little tool this is.
Originally posted by @simonw in https://github.com/simonw/llm/issues/65#issuecomment-1616021169