open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
3.73k stars 397 forks source link

[Bug] There is a large gap in the partial subset of the long text evaluation data set #1061

Open bullw opened 4 months ago

bullw commented 4 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

OpenCompass 0.2.3 transformers 4.35.2 GPU A100

Reproduces the problem - code/configuration sample

None

Reproduces the problem - command or script

python run.py --datasets longbench leval \
              --hf-path /code/open_model/chatglm2-6b-32k \
              --model-kwargs device_map='auto' \
              --max-seq-len 32768 \
              --batch-size 1 \
              --max-out-len 512 \
              --num-gpus 1 \
              --max-partition-size 5000 \
              --max-workers-per-gpu 3 \
              --engine torch

Reproduces the problem - error message

I reproduced most of the scores using the official configuration, But there are a few subsets of the score gap larger please help me analyze.

There is a big difference between the scores below and those on the list:

I need the following questions:

  1. The scores for the four datasets - Coursera, NarrativeQA, NQ, and Topic Retrieval - evaluated by Leval show significant differences, with respective differences of -9.3, -9.42, -14.29, and 8 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"

  2. The scores for the four datasets - NarrativeQA, TREC, LSHT (zh), and Topic Retrieval - evaluated by longbench show significant differences, with respective differences of -10.94, 31.04, and 7.17 points. Could you please tell me the reasons for the errors? What is an acceptable level of error? How can I reduce the error to make the results more accurate?"

  3. For the TREC LSHT (zh) subset of the longbench dataset, I found the corresponding scores on the longbench rank (https://github.com/THUDM/LongBench/blob/main/README.md) to be not significantly different from my results. Should I rely on the OpenCompass leaderboard or the longbench leaderboard?

Other information

No response

findalexli commented 1 month ago

Follwoing this

lxy0727 commented 2 weeks ago

same issue