microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.42k stars 241 forks source link

[Question]: How to get `meetingbank_test_3qa_pairs_summary_formated.json`? #170

Open mzf666 opened 1 month ago

mzf666 commented 1 month ago

Describe the issue

When I am trying to run the script experiments/llmlingua2/evaluation/scripts/compress.sh, it seems that the code for constructing ../../../results/meetingbank_short/origin/meetingbank_test_3qa_pairs_summary_formated.json is missed? Similarly, I can neither found the construction codes for ../../../results/longbench/origin/longbench_test_single_doc_qa_formated.json, ../../../results/zero_scrolls/origin/zero_scrolls_validation.json and ../../../results/gsm8k/origin/gsm8k_cot_example_all_in_one.json.

May I know how to construct these json formatted data files? Thanks for your consideration!

pzs19 commented 1 month ago

Hi, @mzf666, thank you for raising the question.

We have provided the meetingbank_test_3qa_pairs_summary_formated.json on huggingface. For Longbench, you can refer to the format_data scripts and the LongBench repo.

cornzz commented 2 weeks ago

@mzf666 I figured out how to get the dataset into the appropriate format for compress.sh

from datasets import load_dataset
import json
import os

os.makedirs("results/meetingbank_short/origin", exist_ok=True)
if not os.path.exists("results/meetingbank_short/origin/meetingbank_test_3qa_pairs_summary_formated.json"):
    meeting_bank_comp = load_dataset("microsoft/MeetingBank-QA-Summary", split="test")
    json.dump(
        meeting_bank_comp.to_list(),
        open("results/meetingbank_short/origin/meetingbank_test_3qa_pairs_summary_formated.json", "w"),
    )