General Game Player Benchmarks

samschreiber commented 11 years ago

Currently I am not aware of any good ways of objectively benchmarking the real play quality of a GGP player, particularly in a reasonable timeframe. The closest approximation that I have is the long-term average score on an online round-robin server, but this metric has several undesirable properties:

It depends on the other players on the online server. If better/worse players join, your score may go down/up, independent of the quality of your player.
It depends on the distribution of games on the online server. If more games which your player performs better/worse on are added, your score will increase/decrease, independent of the quality of your player.

A benchmark should depend only on a single player. It should be possible to compute in a reasonable time frame, which I'm going to arbitrarily define as 100 hours, but faster is much better, since the benchmark should be part of a design-test-iterate cycle.

These benchmarks should be purely a black-box metric: hypothetically, any player should be able to connect to a "benchmarking server", play a series of games, and get a score at the end. Nothing in the benchmark should depend on the internal state of the player. (Teams should develop their own white-box metrics to supplement this black-box metric, of course, but those aren't general purpose components suitable for GGP Base)

The benchmark does not need to be a single number. A set of numbers, each measuring the quality of a particular aspect of play, is perfectly fine. The benchmark also does not need to be "authoritative", in that there can be multiple equally-valid benchmarks with different values. (As a result, individual benchmarks can be composed to form larger, more comprehensive benchmarks.)

The benchmark should be a real number between 0 and 100. Larger numbers should always be strictly better than smaller numbers.

Any suggestions for objective black-box benchmarks will be appreciated. I hope to put together a large, comprehensive benchmark suite, so that different players (and different versions of the same player) can be independently benchmarked. This should help make the player development process slightly more scientific.

Additional thoughts:

Benchmarks don't necessarily need to be deterministic, since the player isn't likely to be deterministic either. On the other hand, all else being equal, deterministic behavior is preferable.

Since the benchmarks are black-box metrics, they must be computed based on having the target player play through an ordinary game (or a series of such games). There are some complex ways to compute a metric based on a game, but example of some simple one are:

Player plays Tic-Tac-Toe against Random. Benchmark is 100 if they win/tie, 0 if they lose.
Player plays Tic-Tac-Toe against a perfect Tic-Tac-Toe player. Benchmark is 100 if they win/tie, 0 if they lose.
Player plays Connect Four against Legal. Benchmark is 100 if they win, 50 if they tie, 0 if they lose.

Each of these tests can be repeated, to decrease variance. Reasonable times for these matches are 60 + 15x, in seconds, where 'x' is the length of the match. Tic-Tac-Toe matches will be at most 3.25 minutes each, and Connect Four matches will be at most 13 minutes each. These are well within the 6000 minute total time.

The example benchmarks are good at determining whether a player has a baseline of reasonable behavior, which is useful to know, but they don't provide a nuanced, fine-grained notion of the quality of the player's play on harder games.

arr28 commented 10 years ago

Another benchmark: Play all (known good) 1-player games in the repository. Benchmark value is total score / number of games played.

AlexLandau commented 10 years ago

An idea I had a while back: Find a bunch of states in various games with known correct and incorrect responses (i.e. these moves will lead to a goal value of X, these others will lead to a value of Y < X). For each one, have the player play this state of the game. (This can best be achieved by programmatically changing the "init" sentences of the game to the desired state.) See if the response is one of the correct moves. The game can then be abandoned, to quickly proceed to the next test.

Scoring could be one-point-per-success, or failing tests could be repeated with longer play clocks to allow for partial credit.

It may be possible to collect a good selection of moves automatically from the endgames of matches on Tiltyard. This would require complete searches to check which moves are valid responses, of course. (On a related note, it would be interesting to measure the size of the tree that must be examined after the "last mistake" and compare that to player skill levels and the game involved...)

samschreiber commented 10 years ago

I like this idea.

There are ~340k matches recorded on GGP.org (via the ggp.org/researchers page). This is a large number, even if we exclude all games that aren't two player zero sum. We could build a system that reads matches, picks a state a few moves before the end, does a complete search of the remaining game space, and marks down the optimal moves for each role. These could then become test cases, as you describe.

arr28 commented 10 years ago

FWIW, Sancho does something similar as part of its own internal test suite. After a run on Tiltyard, we look through any unexpected losses (i.e. losses in games that aren't badly biased) and identify the bad move(s). Then we have an automated process for pulling the records from Tiltyard and converting them into a regression test. (We don't re-wire the init props. Instead, Sancho has a "fast-forward" test mode where we supply it with moves that it must play to bring it into the state in which it made a mistake.) We still need to manually identify either a set of correct moves or a set of incorrect moves. That's because these mistakes often happen in the middle of a game where, humanly, we can identify a move as good/bad, but there's no way we could ever expect to brute-force all the way to the end of the game.

Obviously, the way Sancho does it (a) wouldn't work for a general player because it relies on our fast-forward test mode and (b) requires manual intervention. But I like your automation ideas.

On a practical level, I'd be slightly wary about deliberately abandoning games. I think we saw in IGGPC'14 that even some of the best players don't really cope well with abandoned games.

samschreiber commented 10 years ago

That's neat. I'd definitely like to build this benchmarking system so that it can work with any player, so that we can include it by default in GGP Base -- that way we can use the benchmark to compare vastly different players (e.g. TurboTurtle vs Sancho), and folks building new players can use the existing benchmark tools to test out their new players without needing to make player-specific changes.

Players should be able to handle abandoned games (at least, if they're properly ended with an ABORT command). GGP Base should handle this cleanly without the players needing to do anything special. And if there are buggy situations where this doesn't work, then having the benchmark using it enthusiastically will encourage people to fix those bugs, improving players overall. So in my opinion that approach is fine. :-)

SteveDraper commented 10 years ago

One minor point of caution. I found recently (when working on games with detected heuristics, like Skirmish and Speed Chess in Sancho's case) that I couldn't reproduce some bad play (usually giving a material-losing sequence) by starting the search at the move in question, because (it turned out) that it was all to do with the effect of contiguous heuristic-changing sequences (multiple piece captures in an extended exchange sequence in this case), which meant that different search biases emerged when expanding the tree from a state before the start of the sequence, rather than from a position part way through it. I modified the way our 'fast forward' works to have it begin its search expansion a few turns ahead of the to-be-tested turn (while still force-playing the necessary moves regardless of the tree that is building up until reaching the to-be-tested-state). Doing this allowed me to recover the original (bad move) behavior from the test cases (and hence enabled me to debug them).

Obviously much of this is Sancho-specific, but the general point of caution I'm making is that starting the search from an earlier state sometimes results in better convergence to the right move sequence than starting it part way through!

samschreiber commented 9 years ago

Fascinating. I suppose that ultimately it's not going to be possible to perfectly reproduce bad behavior without playing through the full game, up to the point of the critical decision. That's going to be more expensive, though. I'd like to see how far we can get with a single-move tests before expanding it to run through multiple states.

I've just checked in https://github.com/ggp-org/ggp-base/commit/868db7a2855eb2afd9c723be37d6240bdcbaf4ef which adds an early prototype of a system that runs single-move tests on a player. It works by producing an "in medias res" version of a game, which starts at the given "critical decision" state, starting a match using that game, and checking whether the player's move response matches one of the "known good" move responses. Lastly it aborts the match so that the player will be ready for more tests.

There's still a lot of work to do, particularly around building up a comprehensive suite of test cases that cover critical decisions in large complex games. (Right now it only tests the first move in tic-tac-toe). But the concept is definitely viable.

samschreiber commented 9 years ago

I've just checked in https://github.com/ggp-org/ggp-base/commit/315c10352adcefa67a7c676ca88ab4e2150e4962 which improves on the initial prototype, including adding a collection of test cases vetted using minimax.

samschreiber commented 9 years ago

BTW, @arr28 and @SteveDraper, if you want to contribute any particularly interesting test cases from Sancho to the PlayerTesterCases in GGP Base so that other developers can benefit from them as well, that would be awesome. :-)

Basically it's just a list of test cases of the form (game, role, start clock, play clock, state, acceptable moves).

arr28 commented 9 years ago

Feel free to convert any of our test cases at https://github.com/SteveDraper/ggp-base/tree/develop/data/tests/suites. There's about a dozen of them (in Tiltyard.*.json). No particular rhyme or reason behind them, but just games where we happened to notice that Sancho made a duff move. (The filename contains the match ID, so you can examine the game in the Tiltyard match database if you like.)

I also started producing a suite to ensure that we played optimally in tic-tac-toe (as defined by this diagram). However, even when Sancho can expand the whole game game tree (which is trivially can with TTT), if it sees that it's a forced draw it doesn't have the necessary logic to give the opponent the maximum opportunity to make a mistake. Therefore, it won't play "optimally" (at least as described by the diagram). As a result, I only added a couple of cases.

samschreiber commented 9 years ago

Great, thanks! I've added a tool for importing Sancho-style tests and imported several in https://github.com/ggp-org/ggp-base/commit/5089a5b998d453044741bbe41f9934be93b16640.

ggp-org / ggp-base

General Game Player Benchmarks #3