Open ilil96 opened 2 months ago
Hi,
For Llama-3 Instruct models, please add the prompt template as shown here. We've updated the code in pred_long_bench.py accordingly. Please give it a try, and feel free to ask if you have any questions!
Thanks!
The meta-llama/Meta-Llama-3-8B-Instruct
results in KIVI's Table 8 are actually borrowed from our recent KV cache compression benchmark paper (https://arxiv.org/abs/2407.01527), and we have just open-sourced its codebase at https://github.com/henryzhongsc/longctx_bench. The KIVI paper was done prior to the release of Llama 3, and we might not have included all the necessary supports for these new models — like the template issue above — in KIVI's public codebase.
In any case, I can confirm that our Table 8 results in KIVI are reproducible via the two scripts (Llama 3 baseline, Llama 3 with KIVI-2). For your convenience, here's the task summary:
Llama-3-8B-Instruct Baseline
{
"individual_dataset_result": {
"narrativeqa": 21.71,
"qasper": 44.24,
"multifieldqa_en": 44.54,
"hotpotqa": 46.82,
"2wikimqa": 36.42,
"musique": 21.49,
"gov_report": 30.04,
"qmsum": 22.57,
"multi_news": 27.86,
"trec": 74.5,
"triviaqa": 90.23,
"samsum": 42.63,
"passage_retrieval_en": 67.0,
"lcc": 57.04,
"repobench-p": 51.12,
"passage_count": 7.0
},
"task_average_result": {
"single_doc_qa": 36.83,
"multi_doc_qa": 34.91,
"summarization": 26.82,
"few_shots": 69.12,
"synthetic": 67.0,
"code": 54.08
},
"LB_average_result": 45.21
}
Llama-3-8B-Instruct with KIVI-2bit
{
"individual_dataset_result": {
"narrativeqa": 21.35,
"qasper": 43.15,
"multifieldqa_en": 44.23,
"hotpotqa": 46.79,
"2wikimqa": 37.05,
"musique": 20.56,
"gov_report": 29.77,
"qmsum": 22.1,
"multi_news": 27.48,
"trec": 74.5,
"triviaqa": 90.54,
"samsum": 42.48,
"passage_retrieval_en": 67.5,
"lcc": 50.84,
"repobench-p": 46.67,
"passage_count": 7.0
},
"task_average_result": {
"single_doc_qa": 36.24,
"multi_doc_qa": 34.8,
"summarization": 26.45,
"few_shots": 69.17,
"synthetic": 67.5,
"code": 48.76
},
"LB_average_result": 44.33
}
(Note we excluded passage_count
from average results as this is very much an outlier.)
Hello,
I ran the code provided for LongBench using the Llama-3-8B-Instruct model but couldn't reproduce the results reported in Table 8 of your paper. Specifically, the full precision baseline model's score for Qasper in my run is 32.11, while the reported score is 44.24.
I used the following command to run the model:
python pred_long_bench.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct --k_bits 16 --v_bits 16
Is there anything I might be missing?