microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.42k stars 241 forks source link

use other quant formats #40

Open zba opened 8 months ago

zba commented 8 months ago

Is it would be hard to use exl2 for same purpose ? Or openai compatible api ?

iofu728 commented 8 months ago

Hi @zba,

Thank you for your interest and support in LLMLingua. I believe there are no block issues with using the exl2 format. You can try replacing the code at LLMLingua Prompt Compressor with ExLlamaV2.

For the OpenAI format, you can use the latest API to obtain log probabilities and set max_tokens to 0. This will help you get the log probabilities for the prompt portion.