hkust-nlp / ceval

Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]
https://cevalbenchmark.com/
MIT License
1.63k stars 78 forks source link

请问下这个结论是根据哪些观察得来的? #30

Closed wwngh1233 closed 1 year ago

wwngh1233 commented 1 year ago

单个科目平均只有 200-300 道题,所以在这上面效果超过 5 个点才能算显著;总的科目有 15k 的题目,这上面效果超过 2 个点可以认为显著

jxhe commented 1 year ago

emm其实这个比较主观,属于经验之谈和主观感受 (maybe don't take it too seriously, sorry : )