open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
3.58k stars 374 forks source link

[Bug] Long text evaluation parameters are not clear #1035

Open bullw opened 4 months ago

bullw commented 4 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

python 3.10.1 OpenCompass 0.2.3 vllm 0.2.3

Reproduces the problem - code/configuration sample

configs/models/chatglm/vllm_chatglm2_6b_32k.py from opencompass.models import VLLM

models = [ dict( type=VLLM, abbr='chatglm2-6b-32k-vllm', path='THUDM/chatglm2-6b-32k', max_out_len=512, max_seq_len=4096, batch_size=32, generation_kwargs=dict(temperature=0), run_cfg=dict(num_gpus=1, num_procs=1), ) ]

Reproduces the problem - command or script

python run.py --model vllm_chatglm2_6b_32k --datasets longbench leval

Reproduces the problem - error message

The difference between the evaluation result parameters and the document long text evaluation is about 20 points, The score for the document can not be reproduced.

  1. “max_seq_len、max_out_len” Should these two parameters be modified in any way?

Other information

No response

liushz commented 4 months ago

For optimal performance, it is advisable to configure the max_seq_len parameter to the highest value feasible, such as 32768 or even higher if possible. As for the max_out_len, it typically comes with a preset default value within the dataset configuration. You have the option to adjust this to 256, or you may simply retain the default setting.

bullw commented 4 months ago

Thank you very much. I reproduced most of the scores.

I also need to ask, indicators for rouge1, rouge2,rougeL,rougeLsum subset of the score difference is still very large.

  1. What is the reason wow?
  2. What are the indicators used in the rank?

image

image

bullw commented 4 months ago

@liushz