hkust-nlp / ceval

Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]
https://cevalbenchmark.com/
MIT License
1.63k stars 78 forks source link

提交结果问题 #43

Closed 18811449050 closed 1 year ago

18811449050 commented 1 year ago

您好,我将自己的模型在测试集的结果文件按照create_sample.json的格式上传了,但日志里含有大量的报错信息,如下图,这两种不匹配的结果是什么原因导致的呀? Mismatched Questions: Subject 'chinese_language_and_literature' has mismatched questions. Missing counts: [149]. Extra counts: [0]. Subject 'clinical_medicine' has mismatched questions. Missing counts: [143]. Extra counts: [0]. Subject 'sports_science' has mismatched questions. Missing counts: [124]. Extra counts: [0]. Subject 'civil_servant' has mismatched questions. Missing counts: [371]. Extra counts: [0]. Subject 'veterinary_medicine' has mismatched questions. Missing counts: [153]. Extra counts: [0]. Subject 'middle_school_chemistry' has mismatched questions. Missing counts: [126]. Extra counts: [0]. Subject 'middle_school_history' has mismatched questions. Missing counts: [151]. Extra counts: [0]. Subject 'middle_school_geography' has mismatched questions. Missing counts: [50]. Extra counts: [0]. Subject 'middle_school_politics' has mismatched questions. Missing counts: [142]. Extra counts: [0]. Subject 'middle_school_mathematics' has mismatched questions. Missing counts: [121]. Extra counts: [0]. Subject 'middle_school_physics' has mismatched questions. Missing counts: [119]. Extra counts: [0]. Subject 'middle_school_biology' has mismatched questions. Missing counts: [133]. Extra counts: [0]. Subject 'physician' has mismatched questions. Missing counts: [386]. Extra counts: [0]. Subject 'basic_medicine' has mismatched questions. Missing counts: [120]. Extra counts: [0]. Subject 'modern_chinese_history' has mismatched questions. Missing counts: [155]. Extra counts: [0]. Subject 'college_chemistry' has mismatched questions. Missing counts: [167]. Extra counts: [0]. Subject 'college_physics' has mismatched questions. Missing counts: [124]. Extra counts: [0]. Subject 'college_economics' has mismatched questions. Missing counts: [440]. Extra counts: [0]. Subject 'college_programming' has mismatched questions. Missing counts: [286]. Extra counts: [0].

Invalid Answer Option: Subject 'chinese_language_and_literature' has an invalid answer option. Subject 'clinical_medicine' has an invalid answer option. Subject 'sports_science' has an invalid answer option. Subject 'civil_servant' has an invalid answer option. Subject 'veterinary_medicine' has an invalid answer option. Subject 'middle_school_chemistry' has an invalid answer option. Subject 'middle_school_history' has an invalid answer option. Subject 'middle_school_geography' has an invalid answer option. Subject 'middle_school_politics' has an invalid answer option. Subject 'middle_school_mathematics' has an invalid answer option. Subject 'middle_school_physics' has an invalid answer option. Subject 'middle_school_biology' has an invalid answer option. Subject 'physician' has an invalid answer option. Subject 'basic_medicine' has an invalid answer option. Subject 'modern_chinese_history' has an invalid answer option. Subject 'college_chemistry' has an invalid answer option. Subject 'college_physics' has an invalid answer option. Subject 'college_economics' has an invalid answer option. Subject 'college_programming' has an invalid answer option. Subject 'professional_tour_guide' has an invalid answer option. Subject 'business_administration' has an invalid answer option. Subject 'ideological_and_moral_cultivation' has an invalid answer option. Subject 'operating_system' has an invalid answer option. Subject 'teacher_qualification' has an invalid answer option. Subject 'education_science' has an invalid answer option. Subject 'plant_protection' has an invalid answer option. Subject 'probability_and_statistics' has an invalid answer option. Subject 'mao_zedong_thought' has an invalid answer option. Subject 'law' has an invalid answer option. Subject 'legal_professional' has an invalid answer option. Subject 'accountant' has an invalid answer option.

zzh068 commented 1 year ago

hello, your results file seems not incomplete, the first error message shows the details; also, some of the response options are not expected, you can check the second error message for details. “Missing counts” means the corresponding number of questions is missing for some subjects. Please check your uploaded file again~ If you have any questions, please feel free to contact us ^_^

18811449050 commented 1 year ago

submission id:658972a9-2c5f-11ee-bd8b-00163e0166d9

马国良

@. | ---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2023年7月28日 12:02 | | 收件人 | @.> | | 抄送人 | Guoiliang @.> , @.***> | | 主题 | Re: [SJTU-LIT/ceval] 提交结果问题 (Issue #43) |

hello, your results file seems not incomplete, the first error message shows the details; also, some of the response options are not expected, you can check the second error message for details. “Missing counts” means the corresponding number of questions is missing for some subjects. Please check your uploaded file again~ ^_^

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

18811449050 commented 1 year ago

这是我的json file 文件格式(如下), 我上传虽然报错,但是有结果,每个分数比较低,我上传的时候看到网站要求每个学科至少50道题目,所以每个学科我随机选取60条(但我后来为了测试将id强行修改连续的数值),

Subject 'chinese_language_and_literature' has mismatched questions. Missing counts: [149]. Extra counts: [0]. 其中的一条报错是指原始id=149的数据是必须要有结果吗? { "accountant": { "0": "D", "1": "B", "2": "B", "3": "C", "4": "B", "5": "B", "6": "D", "7": "B", "8": "B", "9": "B", "10": "B", "11": "B", "12": "A", "13": "D", "14": "B", "15": "A", "16": "A", "17": "C", "18": "B", "19": "A", "20": "C", "21": "A", "22": "A", "23": "C", "24": "B", "25": "B", "26": "A", "27": "C", "28": "B", "29": "B", "30": "D", "31": "B", "32": "B", "33": "B", "34": "D", "35": "B", "36": "C", "37": "D", "38": "D", "39": "C", "40": "C", "41": "B", "42": "A", "43": "D", "44": "A", "45": "B", "46": "A", "47": "B", "48": "A", "49": "B", "50": "B", "51": "B", "52": "B", "53": "B", "54": "ABDC", "55": "C", "56": "B", "57": "C" },

zzh068 commented 1 year ago

意思是你缺少了149条记录~

18811449050 commented 1 year ago

可以理解为其实测试集得全部跑,并提取相应结果对吧,如果只跑部分数据,分数结果会拉低,是这样吗? Subject 'accountant' has an invalid answer option. 如上条问题,是因为多选的的存在才会出现invalid answer option?

zzh068 commented 1 year ago

是的,可以这么理解~如果只提交部分数据,得分会比较低 expected option是单个选项,所以多选是invalid

18811449050 commented 1 year ago

好的,谢谢,