[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
I was evaluating how well the (Long)LLMLingua is able to achieve the requested compression rate (focusing on the rate parameter, not target_tokens) and came to these conclusions:
For smaller prompts (< 150 tokens) barely any compression can be achieved, if any at all
Requested compression rate is best achieved for prompts around 2000 tokens
For longer prompts (>5000 tokens) the requested rate is overshot (or undershot)
More detailed results are below.
My question is, am doing something wrong when invoking LLMLingua, or is this behaviour normal?
I adhered to the usage examples in README.md:
Code snippet
```python
compressor = PromptCompressor(
model_name="NousResearch/Llama-2-7b-hf", # or "openai-community/gpt2"
device_map="balanced"
)
...
def compress(prompt, rate, question=""):
if longllmlingua:
res = compressor.compress_prompt(
[prompt],
question=question,
rate=rate,
condition_in_question="after_condition",
reorder_context="sort",
dynamic_context_compression_ratio=0.3,
condition_compare=True,
rank_method="longllmlingua",
)
else:
res = compressor.compress_prompt(prompt, rate=rate)
return res
```
I tested with the default Llama 2 7b aswell as with GPT-2. It seems that with the smaller model the deviation overall is smaller than with the bigger model.
(Prompt lengths measured using the GPT-3.5 tokenizer)
LLMLingua with Llama 2
![Image](https://github.com/user-attachments/assets/68f6e291-088d-4c5b-a38a-19744d43faac)
LLMLingua with GPT-2
![Image](https://github.com/user-attachments/assets/875c159f-2a2f-4068-add5-0b67ce0faa2c)
LongLLMLingua with Llama 2
![Image](https://github.com/user-attachments/assets/ff9b9a56-fd76-4d58-b9e8-f3703147454f)
LongLLMLingua with GPT-2
![Image](https://github.com/user-attachments/assets/c47fde82-de09-4b67-8eb7-9a03a231e571)
In contrast, LLMLingua-2 adheres to the requested compression rate quite well, only slightly overshooting the requested rate:
Describe the issue
I was evaluating how well the (Long)LLMLingua is able to achieve the requested compression rate (focusing on the
rate
parameter, nottarget_tokens
) and came to these conclusions:More detailed results are below. My question is, am doing something wrong when invoking LLMLingua, or is this behaviour normal? I adhered to the usage examples in README.md:
Code snippet
```python compressor = PromptCompressor( model_name="NousResearch/Llama-2-7b-hf", # or "openai-community/gpt2" device_map="balanced" ) ... def compress(prompt, rate, question=""): if longllmlingua: res = compressor.compress_prompt( [prompt], question=question, rate=rate, condition_in_question="after_condition", reorder_context="sort", dynamic_context_compression_ratio=0.3, condition_compare=True, rank_method="longllmlingua", ) else: res = compressor.compress_prompt(prompt, rate=rate) return res ```I tested with the default Llama 2 7b aswell as with GPT-2. It seems that with the smaller model the deviation overall is smaller than with the bigger model.
(Prompt lengths measured using the GPT-3.5 tokenizer)
LLMLingua with Llama 2
![Image](https://github.com/user-attachments/assets/68f6e291-088d-4c5b-a38a-19744d43faac)LLMLingua with GPT-2
![Image](https://github.com/user-attachments/assets/875c159f-2a2f-4068-add5-0b67ce0faa2c)LongLLMLingua with Llama 2
![Image](https://github.com/user-attachments/assets/ff9b9a56-fd76-4d58-b9e8-f3703147454f)LongLLMLingua with GPT-2
![Image](https://github.com/user-attachments/assets/c47fde82-de09-4b67-8eb7-9a03a231e571)In contrast, LLMLingua-2 adheres to the requested compression rate quite well, only slightly overshooting the requested rate:
LLMLingua-2
![Image](https://github.com/user-attachments/assets/2ec11aed-b05e-40d4-8f20-b41c770acd1e)The prompts I used are truncated from the longest prompt in the LongBench GovReport task (link).