microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.42k stars 241 forks source link

[Question]: Fail to reproduce llmlingua on meetingbank #171

Open jzhang538 opened 1 month ago

jzhang538 commented 1 month ago

Describe the issue

Thanks for the interesting work. I tried to reproduce the results of llmlingua on the meetingbank QA dataset with Mistral-7B as the target LLM.

The small LLM I use is https://huggingface.co/NousResearch/Llama-2-7b-hf

However, the results seem much lower than the reported results in Table 4 of llmlingua2 (around 20 than 50.45 in the paper). Here is my implementation:

compressor = PromptCompressor( model_name=args.model_name, model_config={}, use_llmlingua2=False )

iterative_size = 200 comp_dict = compressor.compress_prompt( context=origin, instruction="", question="", rate=args.compression_rate, iterative_size=iterative_size, context_budget="*2.0",
)

I'm wondering if there is any issue with my implementation?

pzs19 commented 1 month ago

Hi, @jzhang538, thank you for raising the question!

I think there are two reasons that may lead to this issue. The first is the parameters of LLMLingua, such as iterative_size or context_budget. The second is the evaluation. Note that we do not use the instruct version of Mistral in experiment, the model may generate lengthy responses and even raise similar questions in the response, which leads to a low performance. So it is necessary to truncate the responses at an appropriate place.