Open gjstein opened 7 years ago
@gjstein I don't think this is quite correct. I'm making an assumption that the game can only be lost due an action of your opponent. This means that negative reward can not be accrued on nodes associated with a state in which it is the agent's turn. It has the effect of assigning 0 reward on alternating levels of the tree after each rollout. With the above code, wins along a certain path will be cancelled out by losses along the same path (++ followed by --). This could result in a branch that has a 50% win rate actually showing up as having 0 wins for both players.
This is the plot with the code as is:
With the suggested code, I get this:
It's interesting that you're seeing better decisions in Connect4. I'll need to code it up to see if I can reproduce that.
Hmm interesting. I saw that as well, so I rewrote some of the ways in which the score is attributed. You can see the full changes in my branch here. Play around with that code if you'd like, along with my implementation of Connect Four (which I'm pretty sure is correct...). I can message in detail about my motivation behind the changes tomorrow.
As for the results, my agent performs rather well against yours. Build & run the ./connectfour
script: In my tests, my modifications have me winning 100% of the time. I can't quite figure out why yours is behaving oddly, but it fails to block its opponent some of the time, like this example (your agent is o
):
- - - - - - -
- - - - - - -
- - x o - - -
- - o x - x -
- o x o o x -
- o x x o x -
IN_PROGRESS
- - - - - - -
- - - - - - -
- - x o - - -
- - o x - x -
- o x o o x -
o o x x o x -
O_WINS
My thinking is that by not back propagating the loss as I suggest above causes it to greedily search for winning board positions for it, while not taking positive action to prevent the opponent from beating it.
New set of changes looks good, I'll pull it into master. I saw that you had Minimax commented out as a connect 4 player. I'm guessing you didn't have the 190 million gigs of ram needed for it to run to completion. :)
You currently have the following code in the
Backpropagation
method:However, this leaves two cases in which the reward is not properly applied, which causes problems for more sophisticated games. I've added the additional two cases and the algorithm seems to make more reasonable decisions now (I'm testing on my Connect Four implementation):
Try adding this for you Tic-Tac-Toe code and I'd be curious to see what the performance statistics look like when you do.