Hi, I see on your leaderboard that chatglm-6b has a higher score than gpt-4, I don't think it's true after experiencing chatglm2-6b, so I wonder if your method of calculating the score can be fair Ability to assess LLM? Sorry for my blunt and almost offensive expression, but I think it is misleading, some people think that the rank1 model in your leaderboard is the sota model so far, but this is not actually true.
The rank0 model is not ChatGLM2-6B, it is ChatGLM2. We don't know the model size of ChatGLM2 (but assumably much larger than 6B ), its weights are not public, its interface may not yet be open to beta test either -- thus you may not experience the rank0 model
ChatGLM2-6B is also on our leaderboard, it underperforms ChatGPT
C-Eval is a Chinese test and a lot of the questions are about knowledge in a Chinese context, I don't think it is very surprising that GPT-4 is not that good on Chinese knowledge. ChatGLM2 actually underperforms GPT-4 on C-Eval Hard -- if you look at their scores on specific subjects, GPT-4 is still much better on complex, STEM subjects
C-Eval leaderboard, along with many other leaderboards, should be viewed with caution, different benchmarks are assessing different abilities -- ALL the public leaderboards that we know have their own limitations.
We emphasize that we create C-Eval to help model development, not to rank models. We do not stand any position on the model ranking
Hi, I see on your leaderboard that chatglm-6b has a higher score than gpt-4, I don't think it's true after experiencing chatglm2-6b, so I wonder if your method of calculating the score can be fair Ability to assess LLM? Sorry for my blunt and almost offensive expression, but I think it is misleading, some people think that the rank1 model in your leaderboard is the sota model so far, but this is not actually true.