Closed ilur98 closed 1 month ago
Thank you for your interests! Are you using our longbench evaluation protocol? We follow the implementation of the Longbench official repo. For Llama-2-7b, its context length is 4k. Because LLMs perform poorly when serving length > pretrained length, so for prompts with >4k length, it will truncate the prompt with only the first 2K tokens and the last 2K tokens.
----- Update ----- We also reproduce your results using our script. We found that the reason why Llama-7b-hf fails on these two datasets is due to the addition of [INST] token to the prompt. We confirmed this because Llama-7b-hf with FP16 KVCache also give poor results. After removing these the INST tokens, below is our result:
Llama-2-7b-hf | qasper | qmsum |
---|---|---|
FP16 (w./ INST token) | 5.58 | - |
FP16 (w./o. INST token) | 9.53 | 21.33 |
4bits_group32_residual128 (w./o. INST token) | 9.35 | 21.34 |
2bits_group32_residual128 (w./o. INST token) | 9.24 | 20.77 |
We have updated the code to fix this bug. Let me know if you still cannot reproduce the above results.
Thank you for solving my confusion.
This is a nice work! When I try to benchmark the Llama2-7b with KIVI (Llama-2-7b-hf_4096_4bits_group32_residual128). I found that the tasks "qmsum" and "qasper" are much lower than the results in the paper. I also found that the sequence length of these two tasks is much longer than 4096. For example, the average length of "qmsum" is about 10k. This may be the reason the performance on two tasks are so poor. So what should I do to obtain the result in the paper? What follows are the results I get. { "samsum": 41.59, "multi_news": 1.32, "qmsum": 7.07, "repobench-p": 59.97, "trec": 66.0, "triviaqa": 87.72, "qasper": 5.96, "lcc": 66.76 }