Is code's position wrong with "kv_cache.evict_for_space" ?

DavideHe commented 9 months ago

as your paper see: streamingLLM will work on generation but your "kv_cache.evict_for_space" is in "streaming_inference" not in "greedy_generate".

if "kv_cache.evict_for_space" in "streaming_inference" , is there different with dense_attention like position extrapolation ？

DavideHe commented 9 months ago

I found it should add code "past_key_values = kv_cache(past_keyvalues)" on the end of block "for in range(max_gen_len - 1):"

Guangxuan-Xiao commented 9 months ago

I appreciate your interest!

When implementing the StreamingLLM, there are two primary methods to evict tokens from the kv_cache:

Token-by-Token Eviction: This approach aligns with how we evaluate perplexity, as detailed in our paper. The eviction can be done as past_key_values = kv_cache(past_key_values).
Batched Token Eviction: Alternatively, you can choose to evict an entire chunk of tokens in a single step by calling kv_cache.evict_for_space. This method is beneficial as it enables the batched encoding of prompts, which in turn can lead to faster inference.

The placement of kv_cache.evict_for_space in the streaming_inference rather than greedy_generate is intentional and suited for the aforementioned batched token eviction method.

Guangxuan

mit-han-lab / streaming-llm

Is code's position wrong with "kv_cache.evict_for_space" ? #48