gpt-4-1106-preview 有人测试过 test 的分数吗？

theblackcat102 commented 11 months ago

个人在 val 上采用 5-shot prompting 得到的基本分数异常的低（比 turbo 还糟糕）

gpt-4-1106-preview  computer_network    15.78947
gpt-4-1106-preview  operating_system    10.52632
gpt-4-1106-preview  computer_architecture   0.00000
gpt-4-1106-preview  college_programming 32.43243
gpt-4-1106-preview  college_physics 36.84211
gpt-4-1106-preview  college_chemistry   0.00000

@HYZ17 你们有内部试过 gpt-4-turbo 的 test 表现吗？至少我这里用 val 跑 3.5-turbo 结果和 test 蛮相近的

HYZ17 commented 11 months ago

我们进行了小部分的科目的zero-shot测试，发现gpt-4-turbo的输出格式很多样，且有时候会拒绝给出答案。这可能是准确率低的原因之一

houxiang676 commented 8 months ago

test没有答案怎么办啊

HYZ17 commented 8 months ago

在这个网站https://cevalbenchmark.com/ 提交获得分数

hkust-nlp / ceval

gpt-4-1106-preview 有人测试过 test 的分数吗？ #68