评估新下载的模型结果比官方低一半

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

运行如下评估脚本，结果很低。其中模型是从魔搭下载的最新的，未做过任何训练。 CUDA_VISIBLE_DEVICES=0 python src/evaluate.py \ --model_name_or_path /data/mlops/models/chatglm2-6b \ --save_dir /data/mlops/evaluation/results/chatglm2-6b/cmmlu_2024-01-04-10-00-00 \ --template vanilla \ --task mmlu \ --split test \ --lang zh \ --n_shot 5 \ --batch_size 4

Expected behavior

期望能得到官方的结果

Average: 45.46 STEM: 40.06 Social Sciences: 51.61 Humanities: 41.23 Other: 51.24

实际得到的结果

    Average: 24.90
       STEM: 25.68

Social Sciences: 24.56 Humanities: 23.91 Other: 25.50

System Info

请问这是什么原因

Others

No response

hiyouga / LLaMA-Factory

评估新下载的模型结果比官方低一半 #2076

Reminder

Reproduction

Expected behavior

System Info

Others