Discrepancy in Reproduced Results for LLaMA2 on "qmsum" and "qasper" tasks.

ilur98 commented 1 month ago

This is a nice work! When I try to benchmark the Llama2-7b with KIVI (Llama-2-7b-hf_4096_4bits_group32_residual128). I found that the tasks "qmsum" and "qasper" are much lower than the results in the paper. I also found that the sequence length of these two tasks is much longer than 4096. For example, the average length of "qmsum" is about 10k. This may be the reason the performance on two tasks are so poor. So what should I do to obtain the result in the paper? What follows are the results I get. { "samsum": 41.59, "multi_news": 1.32, "qmsum": 7.07, "repobench-p": 59.97, "trec": 66.0, "triviaqa": 87.72, "qasper": 5.96, "lcc": 66.76 }

zirui-ray-liu commented 1 month ago

Thank you for your interests! Are you using our longbench evaluation protocol? We follow the implementation of the Longbench official repo. For Llama-2-7b, its context length is 4k. Because LLMs perform poorly when serving length > pretrained length, so for prompts with >4k length, it will truncate the prompt with only the first 2K tokens and the last 2K tokens.

----- Update ----- We also reproduce your results using our script. We found that the reason why Llama-7b-hf fails on these two datasets is due to the addition of [INST] token to the prompt. We confirmed this because Llama-7b-hf with FP16 KVCache also give poor results. After removing these the INST tokens, below is our result:

Llama-2-7b-hf	qasper	qmsum
FP16 (w./ INST token)	5.58	-
FP16 (w./o. INST token)	9.53	21.33
4bits_group32_residual128 (w./o. INST token)	9.35	21.34
2bits_group32_residual128 (w./o. INST token)	9.24	20.77

We have updated the code to fix this bug. Let me know if you still cannot reproduce the above results.

ilur98 commented 1 month ago

Thank you for solving my confusion.

jy-yuan / KIVI

Discrepancy in Reproduced Results for LLaMA2 on "qmsum" and "qasper" tasks. #7