hkust-nlp / ceval

Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]
https://cevalbenchmark.com/
MIT License
1.59k stars 73 forks source link

Why chatglm2-6b score is higher than gpt-4 in your leaderboard? #31

Closed ChristopheZhao closed 1 year ago

ChristopheZhao commented 1 year ago

Hi, I see on your leaderboard that chatglm-6b has a higher score than gpt-4, I don't think it's true after experiencing chatglm2-6b, so I wonder if your method of calculating the score can be fair Ability to assess LLM? Sorry for my blunt and almost offensive expression, but I think it is misleading, some people think that the rank1 model in your leaderboard is the sota model so far, but this is not actually true.

image
jxhe commented 1 year ago
  1. The rank0 model is not ChatGLM2-6B, it is ChatGLM2. We don't know the model size of ChatGLM2 (but assumably much larger than 6B ), its weights are not public, its interface may not yet be open to beta test either -- thus you may not experience the rank0 model
  2. ChatGLM2-6B is also on our leaderboard, it underperforms ChatGPT
  3. C-Eval is a Chinese test and a lot of the questions are about knowledge in a Chinese context, I don't think it is very surprising that GPT-4 is not that good on Chinese knowledge. ChatGLM2 actually underperforms GPT-4 on C-Eval Hard -- if you look at their scores on specific subjects, GPT-4 is still much better on complex, STEM subjects
  4. C-Eval leaderboard, along with many other leaderboards, should be viewed with caution, different benchmarks are assessing different abilities -- ALL the public leaderboards that we know have their own limitations.
  5. We emphasize that we create C-Eval to help model development, not to rank models. We do not stand any position on the model ranking
ChristopheZhao commented 1 year ago

Sorry for my misunderstanding about the chatglm 2 version at rank 0, thanks for the explanation.