microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.18k stars 222 forks source link

[Question]: Reproduce LLMLingua-2 results with Mistral-7B #155

Open xvyaward opened 1 month ago

xvyaward commented 1 month ago

Describe the issue

First of all, thank you for your great contributions.

I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.

compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB

Here are some results from the paper and my reproduced scores:

MeetingBank MeetingBank LongBench
QA summary 2000 token avg. 2000 token narrativeqa multifieldqa_en multifieldqa_zh qasper
LLMLingua-2 76.22 30.18 26.8
Original prompt 66.95 26.26 24.5
LLMLingua-2 reproduced 73.59 29.95 25.65 10.07 36.61 26.47 29.46
Original prompt reproduced 66.05 26.89 26.47 10.05 38.7 31.46 25.67

I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.

Here is the example process that I followed for MeetingBank QA evaluation.

  1. I made meetingbank_test_3qa_pairs_summary_formated.json by modifying format_data.py.
  2. Made compressed_prompt using
    python compress.py --load_origin_from ../../../results/meetingbank/origin/meetingbank_test_3qa_pairs_summary_formated.json \
    --model_name microsoft/llmlingua-2-xlm-roberta-large-meetingbank
    --compression_rate 0.33 \
    --force_tokens "\n,?,!,." \
    --save_path ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
  3. evaluate with
    python eval_meetingbank_qa_local_llm.py --load_prompt_from ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json \
    --load_key compressed_prompt \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --save_path ../../../results/meetingbank/llmlingua2/mistral_7b/answer_ratio33_meetingbank_test_3qa_pairs_summary_formated.json

    I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.

iofu728 commented 1 month ago

Hi @xvyaward, thanks for your support in LLMLingua-2 and share detailed results. These results seem quite good and are generally similar to ours. I would like to confirm which specific metric you are most concerned about that did not meet your expectations.

pzs19 commented 1 month ago

Hi @xvyaward, thanks for your interest and the very detailed description.

  1. The multifieldqa_zh should be excluded here. As for Chinese, we have evaluated the performance of LLMLingua-2 on Chinese in another experiment, please refer to the Table 9 of our paper for the results.

  2. Could you please share more information on how you use the mistral model for inference? Since the sampling parameters and evaluation strategies can have an impact on the overall performance, such as the temperature and whether to truncate the answer when "\n" appears.

As for our experiment, we use the official github repo of mistral for inference and download the model from mistralcdn.

Hope these explanations can help you.