open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
3.62k stars 383 forks source link

CMB + Qwen1.5-72B-Chat got empty answers #1141

Open qy1026 opened 3 months ago

qy1026 commented 3 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

python -c "import opencompass.utils;import pprint;pprint.pprint(dict(opencompass.utils.collect_env()))"

Reproduces the problem - code/configuration sample

CUDA_VISIBLE_DEVICES="0,1,2,3" python run.py \ --datasets cmb_gen_dfb5c4 \ --hf-path "/Qwen1.5-72B-Chat/" \ --model-kwargs device_map='auto' trust_remote_code=True \ --tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ --max-out-len 100 \ --max-seq-len 2048 \ --batch-size1 \ --num-gpus 4

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES="0,1,2,3" python run.py \ --datasets cmb_gen_dfb5c4 \ --hf-path "/Qwen1.5-72B-Chat/" \ --model-kwargs device_map='auto' trust_remote_code=True \ --tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ --max-out-len 100 \ --max-seq-len 2048 \ --batch-size1 \ --num-gpus 4

Reproduces the problem - error message

CMB + qwen1.5-72b-chat: test acc=0.26%, almost all the predictions in cmb_test_k.json are "" (no answers). examples: "21": { "origin_prompt": "以下是中国医师考试中规培结业考试的一道多项选择题,不需要做任何分析和解释,直接输出答案选项。\n每一精神症状均有明确定义,并具有以下特点\nA. 症状的出现不受病人意识控制\nB. 症状出现可受病人意识控制\nC. 症状可以通过转移的方法使其消失\nD. 症状内容与周围环境不相称\nE. 症状给病人带来不同程度的功能损害 \n 答案: ", "prediction": "", "gold": "NULL" }, "22": { "origin_prompt": "以下是中国医师考试中规培结业考试的一道单项选择题,不需要做任何分析和解释,直接输出答案选项。\n关于慢性粒细胞白血病,错误的是\nA. 造血干细胞恶性克隆性疾病\nB. 自然病程仅数月\nC. 分为慢性期、加速期和急变期\nD. 最显著的体征是脾大\nE. 血象白细胞持续增高 \n 答案: ", "prediction": "", "gold": "NULL" }, "23": { "origin_prompt": "以下是中国医师考试中规培结业考试的一道单项选择题,不需要做任何分析和解释,直接输出答案选项。\n确定颌位关系包括\nA. 定位平面记录\nB. 下颌后退记录\nC. 面下1/3高度记录\nD. 垂直距离和下颌前伸(牙合)记录\nE. 垂直距离和正中关系记录 \n 答案: ", "prediction": "", "gold": "NULL" },

While CMB + qwen1.5-32b-chat is normal with an acc around 52%

Other information

No response

bittersweet1999 commented 3 months ago

You can try to set do_sample = True in model's generation_kwargs and see whether have differences.