Enhancing quality - Recovery settings

microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

MIT License

4.49k stars 251 forks source link

As mentioned in the paper, key concepts might get omitted either corrupted by the compression, in a way that the GPT can't process the compressed prompt.

You mention also there is an approach to optimize around this issue; could you share details on the corresponding configuration options in the Python implementation?

In the attached image, I've tested the GPT confidence degradation according to compression effects on the _qaspere subset of the LongBench benchmark.

fig_scatter_plots_pcompr_confidence

Wrong answers/no answer possible:

Regular GPT-4: %45.36 e.g. without prompt compression (GPT-4 seems to "give up" frequently on longer queries)
Compressed prompt by LLM Lingua, target_token=200: 63.93%
Compressed prompt by LLM Lingua, target_token=400: 60.66%

microsoft / LLMLingua

Enhancing quality - Recovery settings #89