Closed iamsk closed 1 month ago
Clever idea! I know your goal is to use statistics, specifically the average number of guesses each model makes before correctly guessing the soup base, to determine the model's intelligence level. However, there's an issue: what if the model determines that it has correctly guessed the soup base (but actually hasn't)?
Using the same LLM (Claude Sonnet 3.5) as the referee, replace the player with different LLMs. The referee has the soup base, making it easier for judging, and maintains the same accuracy in the end.
I see. Then the performance is not a fixed value (correct/all). Every time there is a new SOTA model, we should test all models again.
This method can measure the distance between different lower-end models and SOTA models. However, if you don't know whether the SOTA is GPT-4 or Claude 3.5 Sonnet, choosing which one to use would also be a problem.
Overall, I think it's a good method for measuring the relative performance of models.
Yes, rank matters. And, I think GPT-4, Claude 3.5 Sonnet or Gemini is fine.
Referee knows the soup base, Player don't. So the lower-end referee model smarter
than the higher-end player model.
build a new benchmark which don't need annotated data.