microsoft / LLMLingua

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.63k stars 255 forks source link

[Question]: Achieved compression rate with (Long)LLMLingua not meeting expectations? #195

Open cornzz opened 9 hours ago

cornzz commented 9 hours ago

Describe the issue

I was evaluating how well the (Long)LLMLingua is able to achieve the requested compression rate (focusing on the rate parameter, not target_tokens) and came to these conclusions:

More detailed results are below. My question is, am doing something wrong when invoking LLMLingua, or is this behaviour normal? I adhered to the usage examples in README.md:

Code snippet ```python compressor = PromptCompressor( model_name="NousResearch/Llama-2-7b-hf", # or "openai-community/gpt2" device_map="balanced" ) ... def compress(prompt, rate, question=""): if longllmlingua: res = compressor.compress_prompt( [prompt], question=question, rate=rate, condition_in_question="after_condition", reorder_context="sort", dynamic_context_compression_ratio=0.3, condition_compare=True, rank_method="longllmlingua", ) else: res = compressor.compress_prompt(prompt, rate=rate) return res ```

I tested with the default Llama 2 7b aswell as with GPT-2. It seems that with the smaller model the deviation overall is smaller than with the bigger model.

(Prompt lengths measured using the GPT-3.5 tokenizer)

LLMLingua with Llama 2 ![Image](https://github.com/user-attachments/assets/68f6e291-088d-4c5b-a38a-19744d43faac)
LLMLingua with GPT-2 ![Image](https://github.com/user-attachments/assets/875c159f-2a2f-4068-add5-0b67ce0faa2c)
LongLLMLingua with Llama 2 ![Image](https://github.com/user-attachments/assets/ff9b9a56-fd76-4d58-b9e8-f3703147454f)
LongLLMLingua with GPT-2 ![Image](https://github.com/user-attachments/assets/c47fde82-de09-4b67-8eb7-9a03a231e571)

In contrast, LLMLingua-2 adheres to the requested compression rate quite well, only slightly overshooting the requested rate:

LLMLingua-2 ![Image](https://github.com/user-attachments/assets/2ec11aed-b05e-40d4-8f20-b41c770acd1e)

The prompts I used are truncated from the longest prompt in the LongBench GovReport task (link).

cornzz commented 7 hours ago

-- Moved to separate issue: https://github.com/microsoft/LLMLingua/issues/196 --