Closed DavideHe closed 9 months ago
I found it should add code "past_key_values = kv_cache(past_keyvalues)" on the end of block "for in range(max_gen_len - 1):"
I appreciate your interest!
When implementing the StreamingLLM, there are two primary methods to evict tokens from the kv_cache
:
Token-by-Token Eviction: This approach aligns with how we evaluate perplexity, as detailed in our paper. The eviction can be done as past_key_values = kv_cache(past_key_values)
.
Batched Token Eviction: Alternatively, you can choose to evict an entire chunk of tokens in a single step by calling kv_cache.evict_for_space
. This method is beneficial as it enables the batched encoding of prompts, which in turn can lead to faster inference.
The placement of kv_cache.evict_for_space
in the streaming_inference
rather than greedy_generate
is intentional and suited for the aforementioned batched token eviction method.
Guangxuan
as your paper see: streamingLLM will work on generation but your "kv_cache.evict_for_space" is in "streaming_inference" not in "greedy_generate".
if "kv_cache.evict_for_space" in "streaming_inference" , is there different with dense_attention like position extrapolation ?