microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.48k stars 251 forks source link

[Question]: LLMLingua requires too much GPU memory, and it takes a lot of time to compress long text, such as 16k, etc. How to make it and LLM work at the same time #147

Open dingjingzhen opened 4 months ago

dingjingzhen commented 4 months ago

Describe the bug

LLMLingua requires too much video memory, and it takes a lot of time to compress long text, such as 16k, etc. How to make it and LLM work at the same time

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

iofu728 commented 4 months ago

Hi @dingjingzhen, thanks for supporting LLMLingua. Could you provide more details about how you are using it and your environment?

The LLMLingua series relies on a smaller model, such as BERT-level or llama-7b, to act as a compressor, which offers low overhead compared to larger models like GPT-4. To achieve low latency, it is recommended to use it on a GPU similar to the V100.

dingjingzhen commented 4 months ago

Hi @dingjingzhen, thanks for supporting LLMLingua. Could you provide more details about how you are using it and your environment?

The LLMLingua series relies on a smaller model, such as BERT-level or llama-7b, to act as a compressor, which offers low overhead compared to larger models like GPT-4. To achieve low latency, it is recommended to use it on a GPU similar to the V100.

Since my requirement is to summarize according to fixed text, I can compress offline in advance. In this case, which model has the better effect of compression, without considering delay and gpu memory。qwen1.5 32B,16K