Why chatglm2-6b score is higher than gpt-4 in your leaderboard？

ChristopheZhao commented 1 year ago

Hi, I see on your leaderboard that chatglm-6b has a higher score than gpt-4, I don't think it's true after experiencing chatglm2-6b, so I wonder if your method of calculating the score can be fair Ability to assess LLM? Sorry for my blunt and almost offensive expression, but I think it is misleading, some people think that the rank1 model in your leaderboard is the sota model so far, but this is not actually true.

jxhe commented 1 year ago

The rank0 model is not ChatGLM2-6B, it is ChatGLM2. We don't know the model size of ChatGLM2 (but assumably much larger than 6B ), its weights are not public, its interface may not yet be open to beta test either -- thus you may not experience the rank0 model
ChatGLM2-6B is also on our leaderboard, it underperforms ChatGPT
C-Eval is a Chinese test and a lot of the questions are about knowledge in a Chinese context, I don't think it is very surprising that GPT-4 is not that good on Chinese knowledge. ChatGLM2 actually underperforms GPT-4 on C-Eval Hard -- if you look at their scores on specific subjects, GPT-4 is still much better on complex, STEM subjects
C-Eval leaderboard, along with many other leaderboards, should be viewed with caution, different benchmarks are assessing different abilities -- ALL the public leaderboards that we know have their own limitations.
We emphasize that we create C-Eval to help model development, not to rank models. We do not stand any position on the model ranking

ChristopheZhao commented 1 year ago

Sorry for my misunderstanding about the chatglm 2 version at rank 0, thanks for the explanation.

hkust-nlp / ceval

Why chatglm2-6b score is higher than gpt-4 in your leaderboard？ #31