microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.27k stars 228 forks source link

Speed Up Compression #64

Open pathquester opened 6 months ago

pathquester commented 6 months ago

First of all, thank you for this fantastic project. I was wondering if there are any parameters that help with the speed of the compression, currently using TheBloke/Llama-2-7b-Chat-GPTQ but seems to be slow with default parameters even for text that is not really that long.

iofu728 commented 6 months ago

Hi @pathquester, thank you for your support of LLMLingua.

In the current implementation, the latency of quantization models is not significantly different from that of full-precision models; it might even be slightly higher.

In the future, we plan to support more efficient engines #51. If you are limited by hardware resources, you might consider trying smaller models, such as ahxt/LiteLlama-460M-1T, or phi-2 (which we will support soon).

iofu728 commented 6 months ago

Hi @pathquester,

Thanks to the efforts of the community, phi-2 is now available for use in LLMLingua.

Before using it, please update your transformers to the GitHub version by running pip install -U git+https://github.com/huggingface/transformers.git.

pathquester commented 6 months ago

Thank you so much for the help! Is the GPTQ version also supported now?

iofu728 commented 6 months ago

Yeah, you can also try to use the GPTQ version like TheBloke/phi-2-dpo-GPTQ.

pathquester commented 6 months ago

Will try it soon - one clarification, what is the general quality of the results of a tiny model such as GPT2 small for this? Are the results decent and comparable to what is shown in your examples?

iofu728 commented 6 months ago

Hi @pathquester, based on our experience, even GPT2-small can achieve satisfactory results with moderate compression rates.