mazzzystar / TurtleBenchmark

Benchmark for LLM Reasoning & Understanding with Challenging Tasks from Real Users.
https://mazzzystar.github.io/2024/08/09/turtle-benchmark-zh/
101 stars 7 forks source link

another implementation method #2

Open waleyGithub opened 1 month ago

waleyGithub commented 1 month ago

This game could be implemented with workflow supported by Dify or Coze, and then apply this game into the real world. Inviting people to partipicate in this game could collect more accurate evaluation data.

mazzzystar commented 1 month ago

Thank you for your interest in the game. I'm pleased to inform you that this game is already available in the real world. We've collected guess data from over 4,000 real users on our website. I appreciate your suggestion about Coze/Dify, though I'm not sure I see a significant difference in using those platforms for this particular application.

If you mean allowing users to judge whether a judgment from an LLM is good or bad after they've finished the game, then I think the users' upvote/downvote ratio can tell us something, though it also depends on the story quality.

image