To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Is your feature request related to a problem? Please describe.
Lingua2 struggles to get a perfect compression by setting a fixed discard ratio or target length, because the perfect compression ratio which can preserve all valid tokens and discard all redundant tokens varies for different texts.
I think set a probability threshold instead of setting ration or target length can solve this problem: token will be discarded if its probability of 'discard label' exceed the threshold.
Is your feature request related to a problem? Please describe.
Lingua2 struggles to get a perfect compression by setting a fixed discard ratio or target length, because the perfect compression ratio which can preserve all valid tokens and discard all redundant tokens varies for different texts. I think set a probability threshold instead of setting ration or target length can solve this problem: token will be discarded if its probability of 'discard label' exceed the threshold.
Describe the solution you'd like
No response
Additional context
No response