simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
4.55k stars 252 forks source link

Figure out truncation strategy for continue conversation mode #73

Open simonw opened 1 year ago

simonw commented 1 year ago

I'm still not clear on the best way to truncate messages in continue mode, right now I'm going to leave that and allow the model to return an error - but it would be good to have a strategy for that involving automatic truncating later on.

Originally posted by @simonw in https://github.com/simonw/llm/issues/65#issuecomment-1616021169

simonw commented 1 year ago

This just came up again for the llm chat command:

Single biggest unanswered question, which goes for the existing llm -c conversation mode as well: what happens if the conversation gets longer than the context window?

I assume different models break in different ways. But how to fix this? Two options:

  1. Prevent the conversation from continuing past that point
  2. Truncate the conversation's start (though keep injecting the system prompt) to fit

But in both cases I need to detect when this happens. I could try and catch the error and retry, but that's dependent on knowing what the error looks like.

I could count tokens and predict the error will occur, but I need to have rock-solid token counting for that (which I can get using tiktoken for the OpenAI models, but no idea how I'd get it for other models in plugins).

Maybe part of the answer here is introducing a new standard exception - llm.PromptTooLong perhaps - and then updating all the plugins to raise that exception.

simonw commented 1 year ago

There's a really fancy version of this, where any time you run low on tokens you get the LLM to summarize the previous conversation history in order to condense it up.

Not sure if that should be a feature of LLM directly, but it's pretty interesting.

simonw commented 1 year ago

I'm using the llm chat prototype to see how much it takes to break Llama 2 13B, which has a 4096 documented token limit.

I'm surprised at how much space that is. I've been getting the model to tell me jokes, tell stories etc and I'm still only at about 2713 tokens - counting them like this:

llm logs -c | ttok

That's the GPT-4 tokenizer, not the Llama 2 one, but I imagine the number is pretty close.

simonw commented 1 year ago

I'm at 4492 now and it's still going?

simonw commented 1 year ago

Up to 6563 now. I'm suspecting Llama 2 (with MLC) may be truncating the input for me.

simonw commented 1 year ago

Here's the full log of that conversation. https://gist.github.com/simonw/603dba123237a6e4b36e9bc5bc70b583

simonw commented 1 year ago

Yeah, Llama 2 MLC doesn't seem to have a limit. I piped in a 27000 token CSV:

cat simon-wordcamp.csv | llm -m llama2 --system 'summary' 

The response cut off, but it clearly caught the end of the script:

The transcript you provided is a video of a Q&A session with a group of people discussing various topics related to artificial intelligence (AI) and machine learning. The speakers are discussing their experiences and perspectives on the current state of AI research, including the limitations of current models and the need for more recent training data. They also discuss the use of retrieval augmented generation as a state-of-the-art technique for factual questions, and the potential for directing what the AI is indexing.

Here are some key points that can be gleaned from the transcript:

simonw commented 1 year ago

gpt-3.5-turbo on the other hand:

cat simon-wordcamp.csv | llm -m gpt-3.5-turbo --system 'summary'

Error: This model's maximum context length is 4097 tokens. However, your messages resulted in 29753 tokens. Please reduce the length of the messages.

garyblankenship commented 1 year ago

I like the idea of truncating middle text. Keeping the first prompt(s) and the last amount that will still fit in context.

vividfog commented 1 year ago

In a normal chat "it keeps going" is real, it feels like a lot. But if the conversation starts with an info dump (command line RAG, cat in a text file or a long !multi ... !end), then the context runs out just when the conversation gets interesting.

Using the Python API I've experimented with versions that summarize the whole past (N-1) conversation as future input context, a history so far, and even this naive approach works to some degree. It never gives this modern buffer overflow of context running out. The bot knows what's going on and how we got here. But it doesn't know what I mean if the follow-up is "that sounds neat, can you make it shorter". That is, referring to the exact previous message. In my naive implementation the previous message is the whole history so far. Yet it's surprisingly effective in carrying a conversation.

I feel like the chat side of llm should take some opinion of context management → Chat mode as pure magic. What's impossible then is to know what the context size is. Because that is near-unknown for a gguf, fully unknown for an OpenAI-compliant REST API, and only readable in a .json file for gptq/awq.

Using something like -1000, -2000 or -3000 tokens as the history → summary cutoff point might result in the right effect for all future models, ChatGPT-like long conversations, over time it will accumulate hallucinations, but usually not errors. It's a bit of a hack, but the result is magic. The status quo is "error", this alternative at least keeps going and is full aware of many messages of the past.

And that gives a permission for the non-chat mode to be very absolute, "what you send is what you get", errors and all. Seeing the low level errors is important when testing large context windows manually with large inputs. The docs could guide users to use chat when they want magic, and non-chat when they want the details.

Together those would cover both ends of the context management spectrum.

Python API users can choose to mimic what chat does in between those ends. They can use the conversation step -1 or -5 of whatever they think is the correct cutoff point for past summarization, depending on their RAG chunks size and the model they know. Which they can do themselves, following the CLI code as a reference. I don't think the Python API needs more than a documented way to point to "conversations from N-3 and before".

That's from the perspective of using chat, non-chat and the Python API with a lot of models. What a great little tool this is.