microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.44k stars 244 forks source link

[Question]: LLMLingua-2 Sample-Wise Dynamic Compression Ratio #174

Open cornzz opened 1 month ago

cornzz commented 1 month ago

Describe the issue

Hi,

I have two questions:

  1. Appendix L of the LLMLingua-2 paper talks about allowing the compressor to adjust the compression rate for different samples, but I cannot find any documentation about this in the git repo and looking at compress_prompt_llmlingua2() it seems like it is not possible? Also, I dont quite understand from the explanation in appendix L, how this dynamic compression is supposed to work, where can I find more details?

  2. What is the use_context_level_filter parameter for?

iofu728 commented 3 weeks ago

Hi @cornzz, thanks for your interest in LLMLingua.

  1. First, you can find detailed documentation at this link.
  2. In Appendix L, the DCR is actually determined by using the compressor predictor's output as an indicator to allocate the compression ratio. However, this feature hasn't been added to the library yet. [ToDo] @pzs19
  3. The "use_context_level_filter" controls whether to apply coarse-level prompt compression.
cornzz commented 3 weeks ago

@iofu728 thanks a lot for your response!

Regarding 3: is this what is referred to in the last paragraph of section 4.2 in the paper?

our approach can be readily integrated into the coarse-to-fine framework proposed in LLMLingua (Jiang et al., 2023a), allowing for a higher compression ratio of ∼15x for tasks involving multiple demonstrations or documents.