[Question]: LongBench BM25 reproduce

JUNE515 commented 4 months ago

Describe the issue

I'm interested in your longllmlingua results on LongBench. I reproduced LongBench BM25 2,000-token constraint using ChatGPT. Unlike the your paper's results, the performance is too high. trec task score is 72.5 and most of the other tasks are also high. I would like to know how you produced the bm25 result. I'll show you the parameter I used to reproduce bm25 so I'd appreciate it if you could tell me which one is different. I use same split and parameters other tasks.(only q_format and first inst changing according to original LongBench config)

Thank you

first_inst="Please determine the type of the question below. Here are some examples of questions." q_format="{input}" question= q_format.format(input=input) instruction=first_inst contexts_list = df['ctxs'][i].split("\n") contexts_list = [ "\n".join(contexts_list[ii : ii + 4]) for ii in range(0, len(contexts_list), 4) ] compressed_prompt = llm_lingua.compress_prompt( contexts_list, instruction=instruction, question=question, target_token=1800, condition_compare=True, condition_in_question="after", rank_method="bm25", use_sentence_level_filter=False, use_token_level_filter=False, context_budget="+100", dynamic_context_compression_ratio=0.4, # enable dynamic_context_compression_ratio )

iofu728 commented 4 months ago

Hi @JUNE515, thanks for support in LLMLingua. Thanks for your support in LLMLingua. I checked the parameters you used and found that your actual compression rate might be relatively low. You can refer to the following code:

compressed_prompt  = llm_lingua(
    contexts_list,
    "",
    question,
    target_token=2048,
    use_sentence_level_filter=True,
    condition_in_question="none",
    reorder_context=False,
    dynamic_context_compression_ratio=0,
    condition_compare=False,
    concate_question=False,
    context_budget="+0",
    use_demonstrate_level_filter=True,
    use_token_level_filter=False,
    rank_method="bm25",
    token_budget_ratio=1.0
)

JUNE515 commented 4 months ago

Thank for your respone @iofu728

I have one more question. Also, I reproduced LongBench LongLLMLingua 2,000-token constraint using ChatGPT. But I get summary task 22.0(5.4 low) / few shot task 65.1(4.2 low) / code 49.4 (7.2 low) I think my result seems to be low because both the split method and parameter of the context are equal.

I apply same split method and parameter like your code.ipynb repobench-p example. I would like to know how you produced the longllmlingua result.

Thank you

iofu728 commented 4 months ago

Hi @JUNE515, thanks for your support.

You can reference the LongBench script at https://github.com/microsoft/LLMLingua/blob/main/experiments/llmlingua2/evaluation/eval_longbench.py. Our experiments run in completion mode. For more details, you can refer to https://github.com/microsoft/LLMLingua/blob/main/Transparency_FAQ.md#how-to-reproduce-the-result-in-llmlingua-series-work.

microsoft / LLMLingua

[Question]: LongBench BM25 reproduce #161

Describe the issue