[Bug] There is a large gap in the partial subset of the long text evaluation data set

bullw commented 4 months ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

OpenCompass 0.2.3 transformers 4.35.2 GPU A100

Reproduces the problem - code/configuration sample

None

Reproduces the problem - command or script

python run.py --datasets longbench leval \
              --hf-path /code/open_model/chatglm2-6b-32k \
              --model-kwargs device_map='auto' \
              --max-seq-len 32768 \
              --batch-size 1 \
              --max-out-len 512 \
              --num-gpus 1 \
              --max-partition-size 5000 \
              --max-workers-per-gpu 3 \
              --engine torch

Reproduces the problem - error message

I reproduced most of the scores using the official configuration, But there are a few subsets of the score gap larger please help me analyze.

There is a big difference between the scores below and those on the list:

leval summary

dataset,version,metric,mode,opencompass.models.huggingface.HuggingFace_open_model_chatglm2-6b-32k
LEval_nq,52c33f,rouge1,gen,26.77
LEval_nq,52c33f,rouge2,gen,16.55
LEval_nq,52c33f,rougeL,gen,26.64
LEval_nq,52c33f,rougeLsum,gen,26.54
LEval_narrativeqa,766dd0,rouge1,gen,8.82
LEval_narrativeqa,766dd0,rouge2,gen,1.52
LEval_narrativeqa,766dd0,rougeL,gen,8.01
LEval_narrativeqa,766dd0,rougeLsum,gen,8.18
LEval_coursera,36a006,accuracy,gen,36.05
LEval_topic_retrieval,bf433f,score,gen,52.67

longbench summary

dataset,version,metric,mode,opencompass.models.huggingface.HuggingFace_open_model_chatglm2-6b-32k
LongBench_trec,824187,score,gen,62.00
LongBench_lsht,e8a339,score,gen,29.92
LongBench_narrativeqa,a68305,score,gen,7.30

I need the following questions:

The scores for the four datasets - Coursera, NarrativeQA, NQ, and Topic Retrieval - evaluated by Leval show significant differences, with respective differences of -9.3, -9.42, -14.29, and 8 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"
The scores for the four datasets - NarrativeQA, TREC, LSHT (zh), and Topic Retrieval - evaluated by longbench show significant differences, with respective differences of -10.94, 31.04, and 7.17 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"
For the TREC LSHT (zh) subset of the longbench dataset, I found the corresponding scores on the longbench rank （https://github.com/THUDM/LongBench/blob/main/README.md） to be not significantly different from my results. Should I rely on the OpenCompass leaderboard or the longbench leaderboard?

Other information

No response

findalexli commented 1 month ago

Follwoing this

lxy0727 commented 2 weeks ago

same issue

open-compass / opencompass