mazzzystar / TurtleBench

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.
https://arxiv.org/abs/2410.05262
Apache License 2.0
125 stars 9 forks source link

Using LLMs as players to test which model asks the right questions. #5

Closed iamsk closed 1 month ago

iamsk commented 3 months ago

build a new benchmark which don't need annotated data.

mazzzystar commented 3 months ago

Clever idea! I know your goal is to use statistics, specifically the average number of guesses each model makes before correctly guessing the soup base, to determine the model's intelligence level. However, there's an issue: what if the model determines that it has correctly guessed the soup base (but actually hasn't)?

iamsk commented 3 months ago

Using the same LLM (Claude Sonnet 3.5) as the referee, replace the player with different LLMs. The referee has the soup base, making it easier for judging, and maintains the same accuracy in the end.

mazzzystar commented 3 months ago

I see. Then the performance is not a fixed value (correct/all). Every time there is a new SOTA model, we should test all models again.

This method can measure the distance between different lower-end models and SOTA models. However, if you don't know whether the SOTA is GPT-4 or Claude 3.5 Sonnet, choosing which one to use would also be a problem.

Overall, I think it's a good method for measuring the relative performance of models.

iamsk commented 3 months ago

Yes, rank matters. And, I think GPT-4, Claude 3.5 Sonnet or Gemini is fine. Referee knows the soup base, Player don't. So the lower-end referee model smarter than the higher-end player model.