microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.27k stars 228 forks source link

Params to use for compressing Dialogues #68

Closed vikram71198 closed 6 months ago

vikram71198 commented 6 months ago

Hi,

Thanks for this amazing piece of work. I was trying to use this framework to compress a prompt, which has a dialogue between two people as context & I was trying to compress the dialogue alone. I leave the instruction & question uncompressed.

So, far, even with low compression ratios like 0.1-0.15, I'm seeing significant deviation in outputs for the compressed prompt in comparison to the original, uncompressed prompt. In fact, the compressed prompt spitted out also tends to be unintelligible in quite a few places. I was using the same params as you do here, although I'm not entirely sure what context_budget does exactly.

Also, I currently pass the dialogue in as a str. Would it make any difference segmented it line by line and passing it in as a List[str]?

The dialogue has speaker roles like Agent: & Customer: that are dropped sometimes after compressing, is there a way I can make sure some tokens are never dropped? I'm guessing you do that using force_context_ids? Does this param take input_ids after tokenizing? I'm confused.

Do you have any suggestions on what would be a good/optimal param setting to compress dialogues?

iofu728 commented 6 months ago

Hi @vikram71198, thank you for your interest in LLMLingua.

  1. I suggest you divide the dialogue and input it into the context. This will activate the coarse-level compression.
  2. Regarding the need to retain speaker information, we are planning to support this feature in the coming weeks.
  3. For now, you could manually extract speaker names using regular expressions, replace them with blank characters, and use keep_split to retain the "\n\n" separators between each dialogue. After obtaining the compressed prompt, you can split the dialogue based on "\n\n" and restore the speaker information.
  4. Given your scenario, which seems akin to an Online Meeting, you might find this example helpful.
vikram71198 commented 6 months ago

The example given in your notebook is very helpful for QA tasks, where the answer does not have to be aggregated across multiple portions of the context & is present in a definite, limited text span within the context.

What I mean is, given a conversation between agent & customer at a customer service center, if we wanted to figure out how satisfied the customer is with their issue resolution, the compression technique outlined in that notebook would not really work because this answer is supposed to be an aggregation over multiple portions of the dialogue.

This is just a suspicion, I'm yet to test this out.

iofu728 commented 6 months ago

Hi @vikram71198,

Yes, scenarios similar to RAG, where the answer appears directly in the text, are highly suitable for our method. However, we have also tested more complex scenarios, such as multi-hop QA and other tasks requiring global information. Our method employs a coarse-to-fine approach, not only using the coarse level to eliminate irrelevant documents/segments but also performing compression at a finer granularity. This mechanism allows us to perform well in tasks that require global information.