another implementation method

mazzzystar / TurtleBenchmark

Benchmark for LLM Reasoning & Understanding with Challenging Tasks from Real Users.

101 stars 7 forks source link

Thank you for your interest in the game. I'm pleased to inform you that this game is already available in the real world. We've collected guess data from over 4,000 real users on our website. I appreciate your suggestion about Coze/Dify, though I'm not sure I see a significant difference in using those platforms for this particular application.

If you mean allowing users to judge whether a judgment from an LLM is good or bad after they've finished the game, then I think the users' upvote/downvote ratio can tell us something, though it also depends on the story quality.

mazzzystar / TurtleBenchmark

another implementation method #2