请问下这个结论是根据哪些观察得来的？

hkust-nlp / ceval

Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]

https://cevalbenchmark.com/

MIT License

1.63k stars 78 forks source link

Closed wwngh1233 closed 1 year ago

wwngh1233 commented 1 year ago

单个科目平均只有 200-300 道题，所以在这上面效果超过 5 个点才能算显著；总的科目有 15k 的题目，这上面效果超过 2 个点可以认为显著

jxhe commented 1 year ago

emm其实这个比较主观，属于经验之谈和主观感受 (maybe don't take it too seriously, sorry : )