We should have a design document for the benchmark itself. :
List categories of tasks across games (a table).
What are the common tasks across the games (Position tasks, Presence tasks etc.)?
Look at SentEval, GLUE etc. for inspiration. What were the design decisions that they took?
Should we have control tasks apart from probing tasks?
What should be the nature for each of the tasks? Classification makes more sense because it's more interpretable. Is there a case for regression? Are there edge case where regression makes more sense?
What would a normalized score across all tasks and all games look like? We should probably have a normalized score for each game, and then a normalized score across all games.
We should have a design document for the benchmark itself. :