Open xvyaward opened 1 month ago
Hi @xvyaward, thanks for your support in LLMLingua-2 and share detailed results. These results seem quite good and are generally similar to ours. I would like to confirm which specific metric you are most concerned about that did not meet your expectations.
Hi @xvyaward, thanks for your interest and the very detailed description.
The multifieldqa_zh should be excluded here. As for Chinese, we have evaluated the performance of LLMLingua-2 on Chinese in another experiment, please refer to the Table 9 of our paper for the results.
Could you please share more information on how you use the mistral model for inference? Since the sampling parameters and evaluation strategies can have an impact on the overall performance, such as the temperature and whether to truncate the answer when "\n" appears.
As for our experiment, we use the official github repo of mistral for inference and download the model from mistralcdn.
Hope these explanations can help you.
Describe the issue
First of all, thank you for your great contributions.
I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.
compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB
Here are some results from the paper and my reproduced scores:
I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.
Here is the example process that I followed for MeetingBank QA evaluation.
I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.