Open 56wangyun opened 6 months ago
Hi @56wangyun, thanks for your support with LLMLingua-2.
In general, you should be able to reproduce the results of Table 4 by following the steps in eval_longbench.py
and compress.py
. Could you provide more details, including the codebase for Mistral inference, as well as the coding environment? This information would help ensure accurate replication of the results.
Hi @56wangyun, thanks for providing the detailed information.
I believe the difference in results may indeed be attributed to the use of different Mistral models. As mentioned in the "Mistral-7B as the Target LLM" part of the Experiment section, we utilized the "mistral-7B-v0.1" model (available at https://github.com/mistralai/mistral-src) rather than the "mistral-7B-instruct-v0.1" model as the target model. Hope this information aids in replicating the results.
Describe the issue
We referred to your code https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/evaluation/compress.py, https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/evaluation/eval_longbench.py
target token: 2000 compresse model: llmlingua-2-bert-base-multilingual-cased-meetingbank llm model: Mistral-7B-Instruct-v0.1 (from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/tree/main) LongBench singledoc tasks: qasper, multifieldqa_en, narrativeqa Hardware platform: 1 Nvidia A100-80GB
The result is different from the conclusions in the paper (Table 4, LLMLingua-2-small, LongBench-SingleDoc , 2000-token cons.) The compressed prompt evaluation score is: {'qasper': 32.27, 'multifieldqa_en': 33.04, 'narrativeqa': 8.84} average score 24.7 (25.3 in paper)
The uncompressed prompt evaluation score is: {"multifieldqa_en": 37.07, "qasper": 33.83, "narrativeqa": 19.89} 30.3 (24.5 in paper)
What's the experiment settings in the paper and what makes the difference in the evaluation result. Thank you for your reply