microsoft / LLMLingua

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.63k stars 256 forks source link

[Question]: Reproduce LLMLingua-2 on the LongBench SingleDoc dataset #146

Open 56wangyun opened 6 months ago

56wangyun commented 6 months ago

Describe the issue

We referred to your code https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/evaluation/compress.py, https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/evaluation/eval_longbench.py

target token: 2000 compresse model: llmlingua-2-bert-base-multilingual-cased-meetingbank llm model: Mistral-7B-Instruct-v0.1 (from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/tree/main) LongBench singledoc tasks: qasper, multifieldqa_en, narrativeqa Hardware platform: 1 Nvidia A100-80GB

The result is different from the conclusions in the paper (Table 4, LLMLingua-2-small, LongBench-SingleDoc , 2000-token cons.) The compressed prompt evaluation score is: {'qasper': 32.27, 'multifieldqa_en': 33.04, 'narrativeqa': 8.84} average score 24.7 (25.3 in paper)

The uncompressed prompt evaluation score is: {"multifieldqa_en": 37.07, "qasper": 33.83, "narrativeqa": 19.89} 30.3 (24.5 in paper)

What's the experiment settings in the paper and what makes the difference in the evaluation result. Thank you for your reply

iofu728 commented 6 months ago

Hi @56wangyun, thanks for your support with LLMLingua-2.

In general, you should be able to reproduce the results of Table 4 by following the steps in eval_longbench.py and compress.py. Could you provide more details, including the codebase for Mistral inference, as well as the coding environment? This information would help ensure accurate replication of the results.

pzs19 commented 6 months ago

Hi @56wangyun, thanks for providing the detailed information.

I believe the difference in results may indeed be attributed to the use of different Mistral models. As mentioned in the "Mistral-7B as the Target LLM" part of the Experiment section, we utilized the "mistral-7B-v0.1" model (available at https://github.com/mistralai/mistral-src) rather than the "mistral-7B-instruct-v0.1" model as the target model. Hope this information aids in replicating the results.