hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.77k stars 3.79k forks source link

评估新下载的模型结果比官方低一半 #2076

Closed guoyjalihy closed 8 months ago

guoyjalihy commented 8 months ago

Reminder

Reproduction

运行如下评估脚本,结果很低。其中模型是从魔搭下载的最新的,未做过任何训练。 CUDA_VISIBLE_DEVICES=0 python src/evaluate.py \ --model_name_or_path /data/mlops/models/chatglm2-6b \ --save_dir /data/mlops/evaluation/results/chatglm2-6b/cmmlu_2024-01-04-10-00-00 \ --template vanilla \ --task mmlu \ --split test \ --lang zh \ --n_shot 5 \ --batch_size 4

Expected behavior

期望能得到官方的结果

1 Average: 45.46 STEM: 40.06 Social Sciences: 51.61 Humanities: 41.23 Other: 51.24

实际得到的结果 2

    Average: 24.90
       STEM: 25.68

Social Sciences: 24.56 Humanities: 23.91 Other: 25.50

System Info

请问这是什么原因

Others

No response

hiyouga commented 8 months ago

评估方式不一样,目前项目不支持 CoT