[Question]: How does the token-level question-aware compression work?

microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

MIT License

4.48k stars 251 forks source link

Describe the issue

I saw #103 asked a similar question, but I'm not sure I understand how this works with respect to Equation 5 from the first LLMLingua paper. If I have a query with condition_in_question=after and condition_compare=True, then my understanding is that this means that the probability of a compressed segment will not be actually conditioned on the query since the context (which is the thing we are compressing) comes before the query appears. I know this is probably not actually a problem in the code implementation, but I don't fully understand the implementation and I'm trying to connect between what the paper says and what the code is doing.

I see that Equation 3 in the LongLLMLingua uses the contrastive complexity to score each token, but are we still first segmenting the context and then pruning tokens from the context in a similar way as the original LLMLingua paper? Based on Equation 3, I'm confused how to properly condition on the query. So, regardless of what condition_in_question is, do we put the query the LLM context before the rest of the context in order to compute the ppl on x_i?

Any help is greatly appreciated!

microsoft / LLMLingua

[Question]: How does the token-level question-aware compression work? #141

Describe the issue