mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
https://arxiv.org/abs/2309.17453
MIT License
6.38k stars 355 forks source link

Is code's position wrong with "kv_cache.evict_for_space" ? #48

Closed DavideHe closed 9 months ago

DavideHe commented 9 months ago

as your paper see: streamingLLM will work on generation but your "kv_cache.evict_for_space" is in "streaming_inference" not in "greedy_generate".

if "kv_cache.evict_for_space" in "streaming_inference" , is there different with dense_attention like position extrapolation ?

DavideHe commented 9 months ago

I found it should add code "past_key_values = kv_cache(past_keyvalues)" on the end of block "for in range(max_gen_len - 1):"

Guangxuan-Xiao commented 9 months ago

I appreciate your interest!

When implementing the StreamingLLM, there are two primary methods to evict tokens from the kv_cache:

  1. Token-by-Token Eviction: This approach aligns with how we evaluate perplexity, as detailed in our paper. The eviction can be done as past_key_values = kv_cache(past_key_values).

  2. Batched Token Eviction: Alternatively, you can choose to evict an entire chunk of tokens in a single step by calling kv_cache.evict_for_space. This method is beneficial as it enables the batched encoding of prompts, which in turn can lead to faster inference.

The placement of kv_cache.evict_for_space in the streaming_inference rather than greedy_generate is intentional and suited for the aforementioned batched token eviction method.

Guangxuan