microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.48k stars 251 forks source link

[Question]: How does the token-level question-aware compression work? #141

Closed acnagle closed 2 months ago

acnagle commented 5 months ago

Describe the issue

I saw #103 asked a similar question, but I'm not sure I understand how this works with respect to Equation 5 from the first LLMLingua paper. If I have a query with condition_in_question=after and condition_compare=True, then my understanding is that this means that the probability of a compressed segment will not be actually conditioned on the query since the context (which is the thing we are compressing) comes before the query appears. I know this is probably not actually a problem in the code implementation, but I don't fully understand the implementation and I'm trying to connect between what the paper says and what the code is doing.

I see that Equation 3 in the LongLLMLingua uses the contrastive complexity to score each token, but are we still first segmenting the context and then pruning tokens from the context in a similar way as the original LLMLingua paper? Based on Equation 3, I'm confused how to properly condition on the query. So, regardless of what condition_in_question is, do we put the query the LLM context before the rest of the context in order to compute the ppl on x_i?

Any help is greatly appreciated!

iofu728 commented 4 months ago

Hi @acnagle, Thank you for your question. At the segment level, we only use the condition_in_question parameter, which specifies whether the question is positioned before or after the context. At the token level, we only use the condition_compare parameter to choose between using perplexity or conditional perplexity.

Therefore, with condition_in_question=after and condition_compare=True, the compression of segment $P(context|question)$ is not based on the question.

Yes, we will still segment the context and use a method similar to equation (3). Also, the condition_in_question parameter does not take effect at the token level; it is controlled solely by condition_compare.