microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.42k stars 241 forks source link

Understanding the interplay between `ratio` and `iterative_size` #61

Closed acnagle closed 6 months ago

acnagle commented 7 months ago

Thank you for the interesting work, and making the code easily accessible. I have some confusion on the relationship between the ratio and iterative_size parameters.

In the case I am interested, there is a single demonstration that I want to compress using only the token-level compression approach. I've noticed that, in general, the final ratio between the compressed and original length can vary quite a bit for large enough' ratio' values. However, I noticed that when I make the iterative_size parameter small, e.g. 10, the final compressed ratio is more truthful to the value specified for the ratio parameter.

I'm confused as to why this is the case. From the paper, my understanding was that \gamma_j threshold for segment s_j (whose length is defined by the iterative_size parameter), was based primarily on the ratio parameter. Meaning that, regardless of the iterative_size, LLMLingua would always prune ratio percentage of the tokens in that segment.

Any clarifications of this would be useful, including where in the code \gamma_j is computed.

iofu728 commented 7 months ago

Hi @acnagle, thank you for your support of LLMLingua.

This is a great question, and I believe other users may have similar queries.

The actual compression ratio indeed has a certain relationship with the iterative_size. Specifically, the calculation of (\gamma_j) is based on the global Perplexity (PPL) distribution, determined by calculating the quantile according to the compression ratio, as detailed https://github.com/microsoft/LLMLingua/blob/main/llmlingua/prompt_compressor.py#L621.

However, since it's challenging to directly obtain the actual global PPL distribution after compression, it's also difficult to get a true and accurate quantile. Thus, we update the PPL distribution per segment. A smaller iterative_size results in more sampling points for this estimation, leading to more accurate compression. This is particularly evident when the original prompt size is smaller.

I believe that by improving the get_estimate_threshold_base_distribution, the estimation of the PPL distribution can be made more accurate.

I hope this answers your question, and thank you again for your support.

acnagle commented 6 months ago

Hello again, I was reviewing the code and realized that I might not have understood your explanation as well as I thought. Just to make sure I understand: by using a smaller iterative_size, we are able to get a better approximation for p(\tilde{s}_j) according to eq. 5 in the paper. And if we have a better approximation then our chosen \gamma_j quantile will be more faithful to the compression ratio that we passed in. Is this correct?

You also mentioned in your comment that there might be a way to improve the get_estimate_threshold_base_distribution. Do you have any suggestions in this direction?

iofu728 commented 6 months ago

Hi @acnagle,

Yes, the purpose of iterative compression is to minimize the approximation loss in eq. 5. This approximation can be improved in two ways:

  1. By explicitly learning forward compressed results through training, we will release work based on this perspective next month.
  2. By fitting a curve to compensate for the gap.
acnagle commented 6 months ago

Thank you for your response. I'm looking forward to the followup work!