ucfai / knightros-gambit

Knightr0's Gambit is a partnership between the UCF student chapter of the Institute of Electrical and Electronics Engineers and the UCF Artificial Intelligence club to create an automatic chessboard (similar to Harry Potter's Wizard Chess with less violence) powered by a custom chess AI.

https://ucfai.github.io/knightros-gambit/

2 stars 3 forks source link

Develop way to evaluate performance of our model #110

Open nashirj opened 2 years ago

nashirj commented 2 years ago

Ideas:

Comparing by win/loss against other agents

Compare against an agent that selects actions randomly
Compare against our previous best baseline agent
Compare against stockfish (of varying search depth)

Comparing output move distribution

Compare list of outputted moves to list of best moves outputted by Stockfish. We can use a similarity metric. One option is the Footrule distance. The Footrule distance treats difference at top of list as being just as relevant at bottom of list. However, we care more about the "best" moves, and lower moves in the list are not as important. We should consider some weighted implementation.

nashirj commented 2 years ago

Note: we want to normalize the similarity metric to always be between 0 and 1.

nashirj commented 2 years ago

In our last software meeting, we discussed using a series of puzzles (https://database.lichess.org/#puzzles) to determine the elo of the agent. Each of these puzzles has a distinct best move for the "player" to make, so we can use it to objectively quantify performance. One caveat is that there may be puzzles with multiple 'mate-in-one' moves, in which case any move leading to checkmate should be counted as a solution. We can select a subset of puzzles to use and incorporate these into the training loop every n iterations to quantify model improvement/deterioration.

nashirj commented 2 years ago

A couple interesting quotes I read just now:

Remember that tactics only come about because it's a good position. If you don't know how to play positionally and set up for tactics, they will never show up in your games

I agree that it's not all about tactics, but even if it was, there's no reason these two ratings should be in sync with each other. They are completely different systems. One is a result of head-to-head competition, the other is a solo endeavor where the "rating" you get assigned is really quite arbitrary.

So maybe we should use puzzles as a first pass, and if the new AI can solve the puzzles, we evaluate it with self play against the previous best model?

nashirj commented 2 years ago

Here is how AlphaZero does evaluation:

alphazero-evaluation