mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

How to generate longer token streams? #27

Open GenTxt opened 9 months ago

GenTxt commented 9 months ago

Have everything running on python3.10 under ubuntu 22.04 with 2x 24 gig gpus. Tested original and revised versions of 'mt_bench.jsonl' and output is good with a 70b 4bit gptq model.

Trying to increase number of tokens streamed but it appears fixed for each generation.

Edited 'run_streaming_llama.py - line 61

def streaming_inference(model, tokenizer, prompts, kv_cache=None, max_gen_len=10000):

but output length is similar to default 2000

Edited kv_cache.py 'recent_size=512,' to similar values but length of output remains the same.

Would appreciate options and/or edits required to generate 10000+ tokens

Cheers

iamhappytoo commented 9 months ago

I have some relevant questions here. Appreciation in advance for reading and answering them :) Q1: To confirm, if the context window for a given model is not expanded in streamingLLM (as in FAQ #2), does it mean the upper limit of the length of response to an individual prompt equals the context length of the model being used, despite with streamingLLM? For example, the longchat-7b (context length 16k) seems to generate much longer responses than llama-2-7b-chat-hf (context length 4096, which hangs at around 1000+ tokens).

Q2: If the above understanding is correct (i.e., response length is still limited by context length for a single prompt using streamingLLM), would it be helpful to use multiple follow-up questions to prolong the response/split the input? e.g., is the streamingLLM potentially a good tool to generate long code exceeding the context length using multiple prompts?

Q3: This question is related to the long-term memory of inputs (FAQ 3); for extending output with follow-up questions, how many follow-up questions would be a reasonable threshold for the first prompt still being considered in generating the response (if the long-term memory is not applicable as in FAQ3)?

Q4: Similarly, if splitting inputs into several follow-up questions, would the streamingLLM capable of achieving something like RAG, e.g., considering earlier prompts in generating the latest response? If so, how many earlier prompts would it consider?

Q5: Are there some thresholds to consider when choosing the number of follow-up prompts? I noticed the mt_bench.jsonl mostly has two turns. Does it mean streamingLLM mostly remembers the (n-1)th prompt as the earliest input when generating the nth response? Thank you so much for answering these questions!

Guangxuan-Xiao commented 9 months ago

Hello,

You're observing this because our text generation function terminates once an EOS (End Of Sentence) token is produced. You can see this behavior in the code here: run_streaming_llama.py, Line 54. Given the nature of our questions, the model doesn't always necessitate extensive answers.

For generating longer texts, I recommend referring to our perplexity evaluation code, located here: eval_long_ppl.py.

Guangxuan

iamhappytoo commented 9 months ago

Hello Guangxuan,

Thank you so much for the helpful answer! I see, that makes sense, the hangs were occurring when I set max_gen_len too long (10000, with recent_size = 2000). After changing the max_gen_len to a smaller value (1000) less than recent_size (2000), the hangs no longer occur. Also, during the usage of streamingLLM I find the long-term memory seems to be determined by the ratio of recent_size to max_gen_len, with recent_size =4000 and max_gen_len=1000, the long-term memory (# of chat history it can remember seems to be 4000/1000 - 1 = 3) due to the cache eviction design, this is a very helpful feature for me, I tried several (>2) followup questions and it works great in remembering previous context for the current response generation, thank you!

Assuming enough space for cache to support a very large recent_size, and the # chat history in the cache can keep growing, do you think there is any upper limit in streamingLLM's ability to utilize the cached # chat history as "long-term memory" for generating the current response?

Many thanks!

Zhaoxin