Open xvyaward opened 6 months ago
Hi @xvyaward, thanks for your support in LLMLingua-2 and share detailed results. These results seem quite good and are generally similar to ours. I would like to confirm which specific metric you are most concerned about that did not meet your expectations.
Hi @xvyaward, thanks for your interest and the very detailed description.
The multifieldqa_zh should be excluded here. As for Chinese, we have evaluated the performance of LLMLingua-2 on Chinese in another experiment, please refer to the Table 9 of our paper for the results.
Could you please share more information on how you use the mistral model for inference? Since the sampling parameters and evaluation strategies can have an impact on the overall performance, such as the temperature and whether to truncate the answer when "\n" appears.
As for our experiment, we use the official github repo of mistral for inference and download the model from mistralcdn.
Hope these explanations can help you.
Describe the issue
First of all, thank you for your great contributions.
I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.
compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB
Here are some results from the paper and my reproduced scores:
MeetingBank MeetingBank LongBench
QA summary 2000 token avg. 2000 token narrativeqa multifieldqa_en multifieldqa_zh qasper LLMLingua-2 76.22 30.18 26.8
Original prompt 66.95 26.26 24.5
LLMLingua-2 reproduced 73.59 29.95 25.65 10.07 36.61 26.47 29.46 Original prompt reproduced 66.05 26.89 26.47 10.05 38.7 31.46 25.67 I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.Here is the example process that I followed for MeetingBank QA evaluation.
- I made meetingbank_test_3qa_pairs_summary_formated.json by modifying format_data.py.
- Made compressed_prompt using
python compress.py --load_origin_from ../../../results/meetingbank/origin/meetingbank_test_3qa_pairs_summary_formated.json \ --model_name microsoft/llmlingua-2-xlm-roberta-large-meetingbank --compression_rate 0.33 \ --force_tokens "\n,?,!,." \ --save_path ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
- evaluate with
python eval_meetingbank_qa_local_llm.py --load_prompt_from ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json \ --load_key compressed_prompt \ --model_name_or_path mistralai/Mistral-7B-v0.1 \ --save_path ../../../results/meetingbank/llmlingua2/mistral_7b/answer_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.
Thanks for sharing your issue. May I know how to modify the format_data.py
to obtain meetingbank_test_3qa_pairs_summary_formated.json
? I am not sure how to conduct this procedure.
Edit: TLDR - the model (downloaded from mistralcdn and using the mistral-inference library) starts generating nonsense (see below) at sequence lenghts of 2000-2300 tokens, which is far below the theoretical context window. Using the huggingface version, I reproduce your results roughly, but even then it starts doing the same at ~4k tokens.
My main question: which exact revision of the inference library you used and what n_max_tokens
value was used to truncate the prompts?
Hi @pzs19, I am also currently trying to reproduce the results using mistral 7b v0.1 downloaded from mistral cdn, using the mistralai/mistral-inference repository. My results are currently not even close to the results from the paper (less than 40) so I assume I must be doing something wrong.
I am running on an A100 40GB. This is my code, roughly:
I am not setting eos_id
as the provided evaluation script takes care of truncating the answer. I am also using the prompt template provided in eval_meetingbank_qa.py
: "Write a high-quality answer for the given question using the provided meeting transcript (which may be compressed).\n{transcript}\nQuestion:{question}\nAnswer:"
A problem I ran into is, that the model often just replies with a bunch of newline characters and nothing else. I believe this happens when the input prompt exceeds a certain length:
Could you provide more detailed information on how you performed inference using mistral 7b? For example, what was the n_max_token
parameter used and which version of the mistral-inference
library was used?
Edit: I investigated a bit further, it seems the response quality starts to deteriorate heavily around a prompt length of 2000 tokens and beyond a length of around 2300 the answers consist exclusively of newline characters. I don't understand why, the context window is supposedly 8k?
Edit 2: is the prompt template from eval_meetingbank_qa.py
exactly as used in your evaluation? I am asking as it is missing a space after Answer:
which sometimes causes the model to ignore the question, while it answers the question if a space is added, making me wonder if this doesnt affect the benchmark score considerably.
Edit 3: I suspect I am either incorrectly using the mistral inference library, or it has been changed considerably since you used it for your evaluation. If I use the same model through huggingface it performs very well, with the results matching @xvyaward 's results... Any information on your setup would be very helpful, especially the version / revision of the mistral inference repo
Describe the issue
First of all, thank you for your great contributions.
I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.
compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB
Here are some results from the paper and my reproduced scores:
I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.
Here is the example process that I followed for MeetingBank QA evaluation.
I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.