microsoft / LLMLingua

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.68k stars 262 forks source link

[Question]: Reproduce end2end latency results of LLMLingua-2 #193

Open cornzz opened 1 month ago

cornzz commented 1 month ago

Describe the issue

@pzs19
I would like to reproduce and expand the end2end latency benchmark results of the LLMLingua-2 paper and was therefore wondering if you could provide more details on your experiment setup? Specifically:

Thanks a lot!

pzs19 commented 2 weeks ago

Thank you for raising the questions. There is point to point response:

cornzz commented 2 weeks ago

Thank you very much! 🙂

cornzz commented 2 weeks ago

@pzs19 @iofu728 sorry, a follow up question: which LLM was used for compression in the end-to-end latency benchmark of the original LLMLingua paper? Under "Implementation Details" it says

In our experiments, we utilize either Alpaca-7B4 or GPT2-Alpaca as the small pre-trained language model M𝑠 for compression.

however, as far as I can see, it is not specified which of those two models was used for the end-to-end latency benchmark. Actually it is not specified which compressor was used for the other benchmarks (gsm8k etc.) either, so that would be another question.