Open dingjingzhen opened 4 months ago
Hi @dingjingzhen, thanks for supporting LLMLingua. Could you provide more details about how you are using it and your environment?
The LLMLingua series relies on a smaller model, such as BERT-level or llama-7b, to act as a compressor, which offers low overhead compared to larger models like GPT-4. To achieve low latency, it is recommended to use it on a GPU similar to the V100.
Hi @dingjingzhen, thanks for supporting LLMLingua. Could you provide more details about how you are using it and your environment?
The LLMLingua series relies on a smaller model, such as BERT-level or llama-7b, to act as a compressor, which offers low overhead compared to larger models like GPT-4. To achieve low latency, it is recommended to use it on a GPU similar to the V100.
Since my requirement is to summarize according to fixed text, I can compress offline in advance. In this case, which model has the better effect of compression, without considering delay and gpu memory。qwen1.5 32B,16K
Describe the bug
LLMLingua requires too much video memory, and it takes a lot of time to compress long text, such as 16k, etc. How to make it and LLM work at the same time
Steps to reproduce
No response
Expected Behavior
No response
Logs
No response
Additional Information
No response