[Question]: Reproduce end2end latency results of LLMLingua-2

microsoft / LLMLingua

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

https://llmlingua.com/

MIT License

4.68k stars 262 forks source link

[Question]: Reproduce end2end latency results of LLMLingua-2 #193

Open cornzz opened 1 month ago

cornzz commented 1 month ago

Describe the issue

@pzs19
I would like to reproduce and expand the end2end latency benchmark results of the LLMLingua-2 paper and was therefore wondering if you could provide more details on your experiment setup? Specifically:

Which target LLM was evaluated (and how was it set up, was vLLM or similar used?)
For the result in Table 5, which prompt length was used, what was the prompt?
Whats the definition of end2end latency? From the beginning of compression until the generation of the first token or until the full response is generated?
What was max_token set to, and did you enforce the generation of a minimum number of tokens?

Thanks a lot!

pzs19 commented 2 weeks ago

Thank you for raising the questions. There is point to point response:

The target LLM is GPT-3.5-Turbo-0613, so vllm is not used.
The latency experiment is conducted on the summarization task of MeetingBank, the prompt follows the main experiment.
End2end latency counts from the beginning of compression until the full response is generated.
We set the "max_token" to 400, following the main experiment.

cornzz commented 2 weeks ago

Thank you very much! 🙂

cornzz commented 2 weeks ago

@pzs19 @iofu728 sorry, a follow up question: which LLM was used for compression in the end-to-end latency benchmark of the original LLMLingua paper? Under "Implementation Details" it says

In our experiments, we utilize either Alpaca-7B4 or GPT2-Alpaca as the small pre-trained language model M𝑠 for compression.

however, as far as I can see, it is not specified which of those two models was used for the end-to-end latency benchmark. Actually it is not specified which compressor was used for the other benchmarks (gsm8k etc.) either, so that would be another question.