Unable to Reproduce Results for LongBench

ilil96 commented 2 months ago

Hello,

I ran the code provided for LongBench using the Llama-3-8B-Instruct model but couldn't reproduce the results reported in Table 8 of your paper. Specifically, the full precision baseline model's score for Qasper in my run is 32.11, while the reported score is 44.24.

I used the following command to run the model: python pred_long_bench.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --k_bits 16 --v_bits 16

Is there anything I might be missing?

jy-yuan commented 2 months ago

Hi,

For Llama-3 Instruct models, please add the prompt template as shown here. We've updated the code in pred_long_bench.py accordingly. Please give it a try, and feel free to ask if you have any questions!

Thanks!

henryzhongsc commented 2 months ago

The meta-llama/Meta-Llama-3-8B-Instruct results in KIVI's Table 8 are actually borrowed from our recent KV cache compression benchmark paper (https://arxiv.org/abs/2407.01527), and we have just open-sourced its codebase at https://github.com/henryzhongsc/longctx_bench. The KIVI paper was done prior to the release of Llama 3, and we might not have included all the necessary supports for these new models — like the template issue above — in KIVI's public codebase.

In any case, I can confirm that our Table 8 results in KIVI are reproducible via the two scripts (Llama 3 baseline, Llama 3 with KIVI-2). For your convenience, here's the task summary:

Llama-3-8B-Instruct Baseline

{
    "individual_dataset_result": {
        "narrativeqa": 21.71,
        "qasper": 44.24,
        "multifieldqa_en": 44.54,
        "hotpotqa": 46.82,
        "2wikimqa": 36.42,
        "musique": 21.49,
        "gov_report": 30.04,
        "qmsum": 22.57,
        "multi_news": 27.86,
        "trec": 74.5,
        "triviaqa": 90.23,
        "samsum": 42.63,
        "passage_retrieval_en": 67.0,
        "lcc": 57.04,
        "repobench-p": 51.12,
        "passage_count": 7.0
    },
    "task_average_result": {
        "single_doc_qa": 36.83,
        "multi_doc_qa": 34.91,
        "summarization": 26.82,
        "few_shots": 69.12,
        "synthetic": 67.0,
        "code": 54.08
    },
    "LB_average_result": 45.21
}

Llama-3-8B-Instruct with KIVI-2bit

{
    "individual_dataset_result": {
        "narrativeqa": 21.35,
        "qasper": 43.15,
        "multifieldqa_en": 44.23,
        "hotpotqa": 46.79,
        "2wikimqa": 37.05,
        "musique": 20.56,
        "gov_report": 29.77,
        "qmsum": 22.1,
        "multi_news": 27.48,
        "trec": 74.5,
        "triviaqa": 90.54,
        "samsum": 42.48,
        "passage_retrieval_en": 67.5,
        "lcc": 50.84,
        "repobench-p": 46.67,
        "passage_count": 7.0
    },
    "task_average_result": {
        "single_doc_qa": 36.24,
        "multi_doc_qa": 34.8,
        "summarization": 26.45,
        "few_shots": 69.17,
        "synthetic": 67.5,
        "code": 48.76
    },
    "LB_average_result": 44.33
}

(Note we excluded passage_count from average results as this is very much an outlier.)

jy-yuan / KIVI

Unable to Reproduce Results for LongBench #27