ucfai / knightros-gambit

Knightr0's Gambit is a partnership between the UCF student chapter of the Institute of Electrical and Electronics Engineers and the UCF Artificial Intelligence club to create an automatic chessboard (similar to Harry Potter's Wizard Chess with less violence) powered by a custom chess AI.
https://ucfai.github.io/knightros-gambit/
2 stars 3 forks source link

Develop way to evaluate performance of our model #110

Open nashirj opened 2 years ago

nashirj commented 2 years ago

Ideas:

Comparing by win/loss against other agents

Comparing output move distribution

nashirj commented 2 years ago

Note: we want to normalize the similarity metric to always be between 0 and 1.

nashirj commented 2 years ago

In our last software meeting, we discussed using a series of puzzles (https://database.lichess.org/#puzzles) to determine the elo of the agent. Each of these puzzles has a distinct best move for the "player" to make, so we can use it to objectively quantify performance. One caveat is that there may be puzzles with multiple 'mate-in-one' moves, in which case any move leading to checkmate should be counted as a solution. We can select a subset of puzzles to use and incorporate these into the training loop every n iterations to quantify model improvement/deterioration.

nashirj commented 2 years ago

A couple interesting quotes I read just now:

Remember that tactics only come about because it's a good position. If you don't know how to play positionally and set up for tactics, they will never show up in your games

I agree that it's not all about tactics, but even if it was, there's no reason these two ratings should be in sync with each other. They are completely different systems. One is a result of head-to-head competition, the other is a solo endeavor where the "rating" you get assigned is really quite arbitrary.

So maybe we should use puzzles as a first pass, and if the new AI can solve the puzzles, we evaluate it with self play against the previous best model?

nashirj commented 2 years ago

Here is how AlphaZero does evaluation:

alphazero-evaluation