microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.48k stars 251 forks source link

[Question]: LongLLMLingua vs. LLMLingua2 for chatbot history compression #113

Closed DomStan closed 6 months ago

DomStan commented 6 months ago

Describe the issue

Hey guys, thanks a lot for your great work on prompt compression, really amazing results!

I have a question regarding a chatbot history compression use-case, may I ask for some intuition of yours on which method might work better for it:

  1. LongLLMLingua using the user's last query as the question, with re-ranking/ordering turned off, and treating each chat message as a separate document
  2. Just using LLMLingua2 to compress the chat history, with the possibility to fine-tune the embedding models on a chat-based dataset

Thank you!

iofu728 commented 6 months ago

Hi @DomStan, thanks for your interest and support in LLMLingua. This depends on the overhead you can tolerate.

Generally speaking, for chat scenarios, if the topic is relatively fixed and there's a high requirement for low latency, you might opt for Solution 2. If the topic varies significantly and you can accept online compression, then Solution 1 might be suitable, potentially offering higher performance. However, this will depend on your specific scenario.

Best wishes,